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8:30 — 8:45ат 


Opening Remarks 


Christos Kozyrakis & Rumi Zahir 


8:45 — 10:15am 


Session 1 


Microprocessors 


Power Management of the Third Generation Intel Core Micro 
Architecture formerly codenamed Ivy Bridge 

AMD’s “Jaguar”: A next generation low power x86 core 
proAptiv: Efficient Performance on a Fully-Synthesizable Core 


Sanjeev Jahagirdar, Intel 


Jeff Rupley, AMD 
Ranganathan Sudhaka, MIPS 


10:45 — 12:15pm 


Session 2 


Fabrics & Interconnects 


Systems 


SwitchX Virtual Protocol Interconnect (VPI) Swi 


1:30 — 2:30pm Keynote 1 


FPGA Augmented ASICs: The Time Has Come 


Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC | Ronald Dreslinski, Michigan 


David Riddoch, Solarflare 
Diego Crupnicoff, Mellanox 


tch Architecture 


The Surround Computer Era 


Mark Papermaster, CTO, AMD 


2:50 — 4:20pm Session 3 


Many Core and GPU 


AMD “Trinity” APU 


Phi™ coprocessor (codename Knights Corner) 


4:50 — 5:50pm Session 4 


AMD Radeon HD7970 Graphics Core Next (GCN) Architecture 


Intel® Many Integrated Core Architecture -The first Intel® Xeon 


Michael Mantor, AMD 


Sebstian Nussbaum, AMD 
George Chrysos, Intel 


Multimedia & Imaging 


mW and Zero Bandwidth 


Recognition Applications 


ADI's Revolutionary BF60x Vision Focused Digital Signal 
Processor System On Chip : 25 Billion Operations/Sec @ 80 


Visconti2 — A Heterogeneous Multi-Core SoC for Image- 


Robert Bushey, ADI 


Masato Uchiyama, Toshiba 


5:50 - 6:50pm Session 5 


Integration 


Stacked-Silicon Interconnects 


8:55 - 9:05pm Keynote 2 


Centip3De: A 64-Core, 3D Stacked, Near-Threshold System 
FPGAs with 28Gbps Transceivers Built with Heterogeneous 


Ronald Dreslinski, Michigan 
Ephrem Wu, Suresh Ramalingam, 
Xilinx 


The Future of Wireless Networking 


Marcus Weldon, CTO, Alcatel-Lucent 


IE 


IEEE 
computer CM M 
society Technical Committee on 


Microprocessors and 
Microcomputers 
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8:45 — 10:15am Session 6 Technology % Scalability 
Floating-Point Processing using FPGAs Michael Parker, Altera 
An IA-32 Processor with Wide Voltage Operating Range in 32nm CMOS Gregory Ruhl, Intel 
Reducing Transistor Variability For High Performance Low Power Chips Robert Rogenmoser, SuVolta 


10:45 — 12:15pm Session 7 Soc 
High performance and efficient single-chip small cell base station SoC Kin-Yip Liu, Cavium 
FSM™ (Femtocell Station Modem) - A highly integrated, performance Luca Blessent, Qualcomm 
driven, chipset solution for the small cell market 
Medfield Smartphone SOC - Intels АТОМ 72460 Processor Rumi Zahir, Intel 


1:30 — 2:30pm Keynote 3 Cloud Transforms IT, Big Data Transforms Business 
Pat Gelsinger, СОО Infrastructure Products,EMC [ now CEO, VMWare | 


2:50 — 4:20pm Session 8 Data Center Chips 

POWER7+™: IBM's Next Generation POWER Microprocessor Scott Taylor, IBM 

The Intel® Xeon® Processor E5 Family Architecture, Power Efficiency, Jeff Gilbert, Mark Rowland, Intel 

and Performance 

X-Gene™: 64-bit ARM CPU and SoC Gaurav Singh, Greg Favor, AMCC 
4:50 — 6:20pm Session 9 Big Iron 

SPARC64 X; Fujitsu’s new generation 16 core processor for the next Takumi Maruyama, Fujitsu 

generation UNIX servers 

SPARC T5: 16-core CMT Processor with Sebastian Turullols, Ram 

Glueless 1-Hop Scaling to 8-Sockets Sivaramakrishnan, Oracle 

IBM zNext: the 3rd Generation High Frequency Microprocessor Chip Chung-Lung (Kevin) Shum, , IBM 
6:20 — 6:30pm Closing Remarks 


MIS (intel) AMD 


TECHNOLOGIES 


ж-- Warthman 7 

SO LARFLAR E? Associates КМ | 
£” Technical Writers | ! 

www.warthman.com Micro Magic, Inc. 


THE MAGAZINE FOR COMPUTER APPLICATIONS 


(8 The Linley Group И "Sektor 
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High Performance State Retention with Power Gating 
applied to CPU subsystems — design approaches and 
silicon evaluation 


David Flynn, Fellow, R&D ARM Ltd, Cambridge, UK 


Prototyping the DySER Specialization 
Architecture with OpenSPARC 


Jesse Benson, Ryan Cofell, Chris Frericks, Venkatraman 
Govindaraju, Chen-Han Ho, Zachary Marzec, Tony 
Nowatzki, Karu Sankaralingam 

University of Wisconsin-Madison 


Low Power and High Performance 3-D Multimedia 
Platform 


Po-Han Huang, Chi-Hung Lin, Hsien-Ching Hsieh, 
Huang-Lun Lin and Shing-Wu Tung 

Information and Communications Research Lab. 
Industrial Technology Research Institute 


The Model Is Not Enough: Understanding Energy 
Consumption in Mobile Devices 


James Bornholt, Australian National University, Todd 
Mytkowicz, Microsoft Research, Kathryn S. McKinley, 
Microsoft Research 


Efficient, Precise-Restartable Program Execution on 
Future Multicores 


Gagan Gupta, Srinath Sridharan, and Gurindar S. Sohi, 
University of Wisconsin-Madison 
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Welcome to Hot Chips 24 


Christos Kozyrakis & Rumi Zahir 


Program Committee Co-Chairs 


Program Committee 


Forest Baskett, NEA 

Pradeep Dubey, Intel 

Bob Felderman, Google 
Krisztian Flautner, ARM 
Anwar Ghuloum NVIDIA 
Christos Kozyrakis, Stanford 
Chuck Moore, AMD 

Sameer Nanavati, Qualcomm 


НЮ 
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CHIPS 


Don Newell 

Kunle Olukotun, Stanford 
Mitsuo Saito, Toshiba 
Alan Smith, UC Berkeley 
Guri Sohi, U. of Wisconsin 
Dean Tullsen, UCSD 

Rich Uhlig, Intel 

Fred Weber 

Rumi Zahir, Intel 


Program Statistics 


° 54 submissions 


° Each submission was reviewed by all PC members 


° 25 accepted talks 


° High-end & low-power cores, many core, graphics, server 
chips, multimedia SoCs, networking, ... 


° 5 posters 


° Poster session during morning & afternoon breaks 


Keynotes 


° Marc Papermaster, СТО, AMD 


“Тһе Surround Computing Era 
° Tuesday 8/28", 1.30pm 


° Marcus Weldon, СТО, Alcatel-Lucent 


“Тһе Future of Wireless Networking 
e Tuesday 8/28", 8pm 


е Pat Gelsinger, СОО, EMC 


° Cloud Transforms ІТ, Big Data Transforms Business 
e Wednesday 8/29", 1.30pm 


Tutorials 


° The Evolution of Mobile SoC Programming 


° Organized by Niel Trevett 
° Khronos, ArcSoft, eyeSight, Metaio, Sensor Platforms, the 11ers 


° Die Stacking 


° Organized by Liam Madden 
* AMD, Amkor, Qualcomm, UMC, Xilinx 


Proceedings 


° For registered attendees 


° Talks, posters, and tutorials available on USB key 
° Also available online for tablet users (http://hc24.local) 


° Updated talks available online after the conference 


° Including keynote talks and videos of talks 


е Conference archives available online 


° http://www.hotchips.org 


Conference Etiquette 


° Silence your cellphones during sessions 


е Question on technical talks 


° Wait until the end of the talk 
° Come to the microphone & start with name and affiliation 


° Stick to technical questions please 
e If there is a line, ask a single question 


° For speakers: during the break before your talk 


° Introduce yourself to your session chair 


° Test your slides 


4 к 
> 
Y 
1 
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Сһиск Мооге John Nickolls 


Enjoy Hot Chips 24 
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From the Program Co-Chairs of 


Hot Chips 24 


On behalf of the Program Committee, we are pleased to welcome you to the 24" 
Annual Hot Chips Symposium. 


We received fifty-four (54) submissions this year that covered nearly all areas of the 
semiconductor and computing systems industry. The seventeen-member committee 
carefully reviewed all submissions and selected the top twenty-five (25) that best 
represent the breadth and depth of our field. We also selected five (5) posters that 
represent emerging trends and important work in related technical areas. As usual, the 
conference features the latest processor designs for server and portable systems, 
multimedia and graphics, networking and telecommunications chips, and FPGA devices. 
The diverse program covers designs optimized for sub-threshold voltage operation all 
the way to designs exceeding 5GHz clock frequency. We are also happy to feature two 
excellent talks and four posters from academic projects. 


Multi-core architectures and design for power efficiency remain the two most pervasive 
trends in the program. Nevertheless, specialization and heterogeneity are also emerging 
as important developments. Ten of the twenty-five talks in the program describe chips 
with multiple types of processing engines, programmable and fixed function. 
Heterogeneity is also the focus of the first tutorial that addresses the critical problem of 
software development for the heterogeneous multi-core chips in mobile devices. The 
second tutorial covers how die-stacking technology can improve latency, bandwidth, 
and system size, while preserving the benefits of heterogeneous manufacturing 
processes. Another exciting development this year is the appearance of chips that take 
established instruction sets beyond their traditional application domains. The program 
features talks on a smartphone chip based on the x86 ISA and a server chip based on 
ARM, in addition to talks on the latest designs based on the Power, SPARC, and MIPS 
instruction sets. 


For the keynotes, we selected three exciting talks from leading figures in our industry. In 
the first keynote, Mark Papermaster will cover AMD’s strategy towards heterogeneous 
systems and accelerated computing. In the second keynote, Marcus Weldon will discuss 
the future of wireless telecommunications and its implications to the semiconductor 
industry. The final keynote by Pat Gelsinger will discuss how cloud computing and big 
data are transforming the whole IT industry. 


The high quality of this year’s program is the direct result of the effort of the members 
of the program committee, all of whom worked hard to solicit, select, and improve 
presentations. We would also like to thank Liam Madden, Niel Trevett, Ralph Wittig, and 
Anwar Ghuloum for putting together the tutorials. The members of the organizing 
committee worked equally hard to provide the best possible setup for a successful 
symposium, overcoming several difficulties associated with the new location. An 
incredible amount of effort has gone into organizing tasks that we all take for granted 
such as high quality proceedings, online registration, and meals. Finally, we 
acknowledge the effort of all speakers, without whom there would be no conference. 


Finally, we would like to recognize the contributions of Chuck Moore and John Nickolls 
that passed away recently. In addition to being leading visionaries and innovators in our 
field, Chuck and John were exemplary members of the Hot Chips community that 
contributed greatly through multiple roles. They will be missed. 


Christos Kozyrakis and Rumi Zahir 
Program Co-Chairs 

Hot Chips 24 

August 2012 


Power Management of the Third 
Generation Intel Core Micro Architecture 
топпепу codenamed Ivy Bridge 


Sanjeev Jahagirdar 
Varghese George, Inder Sodhi, Ryan Wells 


Contents 
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> Power Scaling & Efficiency 
> Idle power Management 

> Configurable TDP 

> Clocking 


> Additional Information 
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Intel’s Tick-Tock Philosophy 


а Tock Processors 
e Provide substantial microarchitecture improvement... 
ө ...0n existing manufacturing process 
Tick Processors 
e Retain existing microarchitecture, ... 
e ...but utilize next generation fabrication technology to drive high volume and 
low product cost 


О 


о Тһе Tock: Sandy Bridge 

ө Brought new ring/LLC microarchitecture 

ө Integrated Graphics on ring 

e Integrated North Bridge ("System Agent’), including memory controller 
а The Tick: Ivy Bridge 

e Process lead vehicle: Intel's 22nm process node 


e Ihe Caveat: 
= Some lvy Bridge areas have substantial (tock-like) change (Graphics) 
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Ivy Bridge — the 1% 22 nm Core Product 


Ivy Bridge 
Microarchitecture a Leveraged from Sandy Bridge: 
e Continue the 2-chip platform partition 
(CPU + PCH) 


e Fully integrated on silicon: 

- 2-4 IA Cores 

— Processor Graphics, Media, Display Engine 

— Integrated Memory Controller 

— PCle Controllers 

— Modular On-Die Ring Interconnect 

- Shared LLC between IA Cores and Graphics 
ө Same socket, similar packages 

- Similar SKUs (TDP, die configurations) 


pos e IVB backwards compatible with SNB 
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Ivy Bridge - Key New Things 


Entire chip moves to 22nm 

ө Higher performance/Lower power 
Instruction Set Architecture Enhancements 

ө Floati6 / Fast FS/GS support / REP MOVSB / RDRAND 
Security Enhancements 

ө DRNG/ SMEP 
Power Improvements 

e Scalability features: Config DP 

e Average Power features: DDR power gates / PAIR 
IO/Memory 

e DDHRSL support 

e Improved overclocking support 
Performance Improvements (Instructions/clock) 


Ivy Bridge - Hot Chips 2012 


Contents 


> 
> Power Scaling & Efficiency 
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Power efficiency via scaling & testing 


о Power Scaling in 22nm process extracted in two 
Ways 
e Higher performance in IA & Graphics within a power envelope 
e Lower operating Voltage in System Agent and Memory 
controller 


> Power loss from discrete 
test points and 
interpolation (blue line) 

> lvy Bridge builds a 
quadratic model of the 


Nominal 


operating VF based on enhanced 
Lowest E testing (red line) 
operating 


=== > Optimal voltage at а! 
| operating points 
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Power efficiency via interrupt routing 


о PAIR algorithm lowers power or 
performance impact of re-routable interrupts 
e Compares power-state of all cores eligible to service 
interrupt 
e Chooses “best core’ based on optimization mode 
(Power vs. Performance) 
e “Best Core’ based on the following 
- Core C-states 
- P-state request (turbo vs. non-turbo) 
о Example: 1 core т Сб & 1 in CO 


- Power bias will direct the interrupt to core in CO 
— Performance bias will wake the C6 core 
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Temperature effects 


Q Thermal sensors are located in the hot spots in the IA core 
and GPU core 
о Inverse temperature dependence (ITD) effects more 
pronounced in the 22nm node 
e No sensors at the cold spots 
e |VB estimates the coldest point on the die to based on thermal sensors 
compensate for the effect 
a Manufacturing test voltages at hot and cold temperatures 
e PCU interpolates linearly at run time to determine the voltage 


e Temperature moves slowly enough for the PCU and voltage regulator 
to keep up 
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Contents 


> 
> 
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Ivy Bridge Power Planes 


= Key Power planes 
= Core (Gated — Green) 
= LLC (Ungated — Purple) 
= SA/Display - Red 
= GI - Blue 
= Others (like IO, PLL etc) - Gray 
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d zi Ep IVB Embedded Power Gate 


в || 
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Ivy Bridge has 3 on-die power gating areas 
° Cores (Green) 
e Independent Gating per Core 
e Unified Cache 
e PCIE controller(Red) 
e Gating static only when no connection 
* DDR (Purple) 
e Gating of digital logic in the buffer applied 


during self-refresh mode 


Ivy Bridge - Hot Chips 2012 
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IREM images 


1 core in turbo, other 3 Typical Usage of Cores and 
Cores and Graphics Graphics gated 
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DDR I/O Power Gating 


a Ivy Bridge implements on-die Embedded 
Power Gating (EPG) on DDR I/O 


о Latency & Tradeoffs 


e Latency considerations 
- Enabled on entry into Package СЗ and deeper (memory іп Self 
Refresh) to deal with latency of power gate 
— Additional latency of <5uS for device access to memory during exit 
- Conditional enabling — only if devices can tolerate the latency 
- No Impact to exit latency for interrupts 
e Design tradeofis 
- To get around saving and restoring context, the DDR state is put on 
an ungated power island 


e For Idle/MM07-OP, Intel expects DDR IO to be gated 
~90% of the time 


intel) Ivy Bridge - Hot Chips 2012 


Low Voltage optimizations 


о Small Signal arrays and register files limit ^ Gache Blocks 
the lowest operating voltage and retention 
voltage 

1. Dynamic cache sizing to achieve a lower 

cache Vmin 

e Cache Vmin is limited by баа cells’ or defects 
distributed across the cache 

e A smaller size cache has a lower Vmin due to 
fewer defects 

2. PCU Firmware based register file re- 

initialization on exit from standby states 
e Allows reduction of retention voltage below the 
retention level of the register file 
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Мах(Утіп1,2,3) 


LLC - Dynamic Cache Shrink Feature 
» Reduce LLC cache size dynamically from 8MB to 512KB to дат 


30mV Vmin benefit 


" LLC Expand/Shrink algorithm is developed for this purpose 
» Entry/exit points were defined based on the work loads & 


performance 
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Vmin Benefit with Cache Size 


2 ways 
active 


16 ways 
active 


64KB 512KB 4MB 8MB 


Cache Size 
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LLC - Dynamic Cache Shrink Feature 


= LLC organized in 16 ways. 
= When PCU detects low activity workload 
Flushes 14 ways of the cache and puts ways to sleep 
" Shrinks active ways from 16 to 2 to improve VccMin 
» When PCU detects high activity 
= Expands active ways back to 16 to improve cache hit rate. 


2 
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Ivy Bridge average power reduction 
(relative to SNB) 


SN 2+2 - 17W ММ 2+2 - 25W TNL 2+2 - 35W 


Blue Ray playback 
Blue Ray playback 
Blue Ray playback 
Blue Ray playback 
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Sample silicon measurements аи 
-20% Product results тау уагу 


Power reduction via new РМ features and process scaling benefits 
Benefits on other SKUs varies 
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Leap ahead 


Configurable ТОР & Low Power Mode 


Q Configurable TDP allows multiple TDP levels Higher 
within the same part Ор олары 
ө Greater dynamic range of power/performance 
guaranteed by Intel 
ө Dynamically transition based on runtime triggers 
Ч Low Power Mode defines lowest active 
operating point for the part 
Q Intel offers software driver implementing both 
features 
TDP Down 


ө System designers can utilize this framework and 
Customize to their needs 


(1 Allow OEMs and End Users to take advantage Соо! and Quiet 
of scalability of Intel CPUs 
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С » Leap ahead 


өм additional guaranteed 


Config TDP Up 


Frequency 


TDP Nominal 


Config TDP Down 


-- 


r 


80 


Max Turbo Frequency 


59% 


Dynamic Frequency 
Range 


IA/GPU Power sharing 


о OEMs can configure the cooling limits to <17W 
e Static biasing (X96 to GPU and 100-X% to IA) 
results in sub optimal performance 


о Solution: Distribute power based on workload 
demand 
e Determine target CPU/ring frequency based on 
workload 
- |f actual CPU/ring freg < (target frequency — guard band) 


— Move bias toward CPU Else Move bias toward GPU 
— With hysteresis 
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Core MSRS 
ACNT, MENT, 
TSC 


Power Cors 


GPU MMIO 
SQ Any, GPU 
busyreess 


Requested Power Bias i Output CPU/GPU 
; frequencies 


CPU/ring freq 


a 
Leap ahead” ` 


Platform Power management 


о Power delivery management 
e How do we deal with the platform need to divert 
current from the CPU to other components 
dynamically? 
- |VB PCU will manage the current draw and will 
honor dynamic max current updates 
Platform debug and tuning hooks 
ө |VB provides feedback to platform designers її 
power delivery, & cooling Is limiting performance 
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Contents 
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IVB Clock Domains 


Display Reference 
120 MHz (100MHz DFx 


DCLK — 400/533/667/800 


IO — 1.62 /2.7Ghz Te 


Logic — 162 / 270 MHz 


QCLK — 0.8/1.067/1.34/ 
1.6 GHZ 


IO — 2.7 GHz / 2.5 GHz (DFX) 
Logic — 162 / 270 MHz 
PCU -1600MHz 
SA - 800MHz 
DE - 400/800 Mhz 


ІО- 2.5/5 GHz 


BCLK Reference LCLK — 250/500 Mhz 


100 MHz 


lO 2.5 GHz/5GHz 
ІСІК 250 Mhz/500 Mhz 


UCLK = Scalable 
Freq in 100MHz 


steps 
RCLK 
200 Mhz 


— 
— шшш Scalable Freq in 
50MHz steps 
(or) 


y ) 100 Mhz 
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PLL/Clocking 
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Total (including ЕТ) 
Global Drivers + Islands 
Clock Source + Spines 


Silicon Measurement 


| A Wide Range SB PLL 


Ш PCIE LC PLL 


№ Single Ratio SB PLL 
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Overclocking Enhancements 


Memory 


Memory 


DMICLK/BCLK 
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Core Frequency 4e 
Unlocked turbo limits 


Unlocked core ratios up to 63 in 100MHz 
incrementst 


Programmable voltage offset 
Graphics Frequency 
Unlocked graphics turbo limits 


Unlocked graphics ratios up to 60 in 
50MHz increments 


Programmable voltage offset 


Memory Ratio Ф 

Unlocked memory controller 
Granularity options for 200 and 266MHz 
Logical support up to 2666МН2 


DMICLK (aka BCLK) 


Unlocked PCH clock controller (1MHz 
increments) 


PEG and DMI 


Fixed ratios 


Real-Time Overclocking 


а PCU samples OC parameters 

continuously and updates 

power limits Changes effective immediately 
a OC without reboot: E 


System Information Processor 


Maxi mum Core Hatio Manual Tuning Reference Clock 100.0000 MHz 


* All Controls 


ө 
ө Processor Graphics Ratio косе 
e BCLK (small increments) DO = r= 
e Power Limits: PL1, PL2, Tau 3 
e Additional Turbo Voltage for 
CPU and pGfx ear > модем 


Turbo Boost Short Power Мах Enable С) 


Disable Enable 
Wi t h i n t h e - Turbo Boost Power Time Window © 32.00000000 Seconds 
> 0$ T 
uFK 7... 11 
E e. › Multipliers 

1 Active Core —s s ) 53x 6 

2 Active Cores  m— 00) 52х & 
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[уу Bridge ISA & Security 
enhancements 
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Float16 Data Conversion Instructions 


Q New instructions for supporting conversion between a 16-bit 
floating point memory format and 32-bit single precision 
e VCVTPH2PS, VCVTPS2PH 
e Both 128 (SSE) and 256 bit (AVX) wide vector flavors supported 
e Only supported in the VEX prefix context 
а Facilitates use of single-precision floating point computations 
from a more compressed memory format 
e 1-bit sign, 5-bit exponent, 10-bit significand (+ implicit integer bit) 
а Enables higher dynamic range compared to fixed point within 
the same storage footprint 
e Image processing, video decode, audio processing 
e 50% reduction in storage v. single-precision FP (w/ loss of fidelity) 
о Enumerated via new CPUID feature flag 
e CPUID.1.ECX[29] 
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VCVTPH2PS - Convert 16-bit float to SP 


VCVTPH2PS ymm1, xmm2/mem128 - 256 bit vector 
VCVTPH2PS xmm1, xmm2/mem64 - 128 bit vector 
Converts four packed 16-bit floating-point values in the low 64 bits of XMM2 or 


64-bit memory location to four single-precision floating-point values and writes 
the results in the destination (ХММ1 register). 


VCVTPH2PS xmm1, xmm2/mem64 
96 95 64 63 48 47 32 31 


xmm2/mem64 


convert convert 
convert convert 
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VCVTPS2PH - Convert SP to 16-bit float 


VCVTPS2PH xmm1/mem64, xmm2, imma - 128 bit vector 
VCVTPS2PH xmm1/mem128, ymm2, тт8 - 256 bit vector 


Converts four packed single-precision floating-point values in ХММ2 to four 16- 
bit floating-point values and writes the results in the destination (XMM register 


or memory location). 


VCVTPS2PH xmm1/mem64, xmm2, imma 


convert 


convert convert convert 


xmm1/mem64 
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Write/Read FS/GS Base Instructions 


о New ring-3 instructions for read/write of the FS & GS segment base 
registers 


e To be used by user level code for thread local storage 
e Enumerated via new CPUID feature flag 

- CPUID.7.0.EBX[0] indicates availability (leaf 7, subleaf 0) 
e Requires enabling by OS to permit FS/GS segment base access 

- CR4.RDWRGSFS (bit 16) = 0 (default) 


Q Motivation: 
e Improve scalability and programming ease for user threads 


REP MOVSB/STOSB improvements 


Q Historically optimizing block copy/fill operations tends to be 
microarchitecture specific 
e Lack of a "one size fits all" solution implies CPU model specific algorithms for best performance 
о IVB address this through more optimized REP MOVSB and REP STOSB 
instructions 
e Expect this to replace the need for manual tuning solutions 


e Limitation: If block size is known at compile time and size «264 bytes, then scalar loads & 
stores are still considered faster 


Q Enhancement availability indicated by CPUID.7.0.EBX[9] (ENFSTRG) 


intel) • This bit can be used by run e € ыккан for tuning to a specific implementation 


2 


Digital Random Number Generator (DRNG) 


entropy source - 
Combined Conditioner/ a 
£ Health : 
(2-3 Gb/s) БІ5Т ee Deterministic Rando DRBG: Digital 
жаңы Generator (DRBG) АЕ5 engine 


DRNG Wrapper Interface (embedding specific) 


BIST engine 


п Background: 
ө Entropy is valuable in a variety of uses — Example: “keying material” in cryptography 
e Historically, computing platforms did not һауе а good source of a high quality/high 
performance “епігору source” 
e Typical sources used today are slow (bit rate in Kb/s) (key strokes, mouse clicks etc) 
п IVB introduces high quality/high performance DRNG 
о The DRNG is designed to be Standards compliant 
e ANSI X9.82,NIST SP 800-90 and NIST FIPS 140-2/3 Level 2 certifiable entropy source 
о New instruction: RDRAND - Available at all privilege levels/operating modes 
e Instruction will return a random number (16, 32 or 64-bit) to the destination register 
о New CPUID feature flag for RDRAND enumeration 


CPUID.1.ECX[30] 
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Supervisory Mode Execute Protection (SMEP) 
п Background: 


ө Privilege Escalation Attack causes CPL О access to user 


ОхЕ „ЕЕЕ mode pages 


ө Example: 
o Step 1: Compromise user mode app or trick user intc 
installing attack app 
o Step 2: Exploit OS vulnerability to force control 
transfer to user mode attack code while CPU 
remains in supervisory mode => privilege escalation 
IVB introduces SMEP to help prevent such attacks 


ө Prevents execution of user mode pages while in 
supervisor mode 


System call 


e If CR4.SMEP set to 1 and in supervisor mode (CPL<3), 
instructions may not be executed from a linear address 
for which the user mode flag is 1 


e Available in both 32- and 64-bit operating modes 


0х0..000 
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7 ) ө SMEP is enumerated via CPUID.7.0.EBX[7 
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PCI Express Сеп 3 


Ivy Bridge - Hot Chips 2012 


Ivy Bridge PCI Express Gen 3 


Q Third generation of the PCI Express І/О interface 
ө Delivers nearly twice the І/О bandwidth у. Gen 2 
ө Improves performance for applications sensitive to I/O bandwidth 
= Enables smaller form factors via narrower, faster physical links 
Ч Bandwidth realized through: 
e Faster signaling speed: 8 GT/s 
ө More efficient lane encoding: 128/130 
Q Utilizes Gen 2 I/O channel characteristics 
e Enables compatibility with previous Gen components 
e Enables drop-in upgrade for Sandy Bridge-based platforms 
а Supports PCle bandwidth management & ASPM states 
e Dynamic Link Width Configuration, LOs (Rx & Tx), L1 
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Ivy Bridge PCle Performance” 


Ivy Bridge PCle Gen 3 Throughput 


Q 

m 

= 

— 

5 

2 _ Ivy Bridge PCle Gen 3 
D = Throughput 

9 PCle Gen 2 Maximum 
Е к= Throughput | 


Writes 


| Maximum Packet Size ) 
Q Ivy Bridge delivers nearly 2x Сеп 2 bandwidth 
22 © At similar latencies ` 


= ~300ns typica for upstream read request 


Hail |. QU 


Е Leap ahead === - уу Bridge - Hot Chips 2012 = 
кн S — ———————— 


[уу Bridge PCle Logic Changes 


Primary Channel Primary Channel 


ЕСІМЕ ЕСІН • Sandy Bridge PCle uArch unchanged 


ы 5, = No change to primary channel hub/TL 
= = ыы interface 
9 8 Downstream Configuration, = No change to controller/PHY lanes 
E Transaction Steering КЕШЕГӘ? Transaction Steering interface 
Tx credits Rx credits Receive queue 
| e Gen З changes layered on top of Gen 
2 functionality (additional states, arcs) 
Tx queue and Replay | ВИР Packet E 
buffer -—— | Parallel flows implemented where 
м | = feasible 
© 5 ВхАск/Маск- | 
A 5 " DLLP TLLP 
= Е CL TLP/DLLP MUX Я | | 
5 5 кл — TLP/DLLP DEMUX 
D Phy packet 


encapsulation and 
lane map 


Elasticity Buffer Logic change 


8b10b Decoder 
8b10b Encoder Kali 
Buffer size change 


* Elasticity and De-skew buffers 
have both logic and size change 


Physical Layer 


Tx Symbol stream to AFE 


A 
и | Rx Symbol stream from AFE 
(intel | 
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IPC Improvements 


Ivy Bridge - Hot Chips 2012 


Most Significant IVB IPC Improvements 


о Pipeline MOV elimination 
ө Eliminates Move related micro-operations from the processor 
execution pipeline 
о Pipelined divider 
e Improves throughput of divide related computations 
о Next page prefetcher 
e Enables prefetching to span across a 4K page boundary 
а Shift/Rotate performance 
e Addresses glass jaw concern with crypto and hashing algorithms 
e Addresses clumsiness of partial flag handling 
а 6 additional split load registers 
e Improves performance for loads splitting cache lines 
e Especially critical for AVX or SSE 
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Uncore IPC Features 


Ü 


AFP - Adaptive Fill Policy 
ө Cache heuristics to identify and segregate streaming applications 
о QLRU - Quad-Age LRU algorithm 
e Allows fine-grain “age assignment’ on cache allocation 
e E.g.: prefetched requests are allocated at “middle age” 
а DPT - Dynamic Prefetch Throttling 
e Real-time memory bandwidth monitor 
e Directs core prefetchers to reduce prefetch aggressiveness during 
high memory load scenarios 
о Channel Hashing -- DRAM channel selection mechanism 
e Allows channel selection to be made based on multiple address bits 
e Historically, it had been А(6)|” 
e Allows more even distribution of memory accesses across channels 
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WiMAX connectivity requires a WiMAX enabled device and subscription to a WiMAX broadband service. WiMAX 
connectivity may require you to purchase additional software or hardware at extra cost. Availability of WiMAX is 
limited, check with your service provider for details on availability and network limitations. Broadband performance 
and results may vary due to environment factors and other variables. See www.intel.com/go/wimax_ for more 
information. 

Intel® My WiFi Technology is an optional feature and requires additional software and a Centrino® wireless 
adapter. Wi-Fi devices must be certified by the Wi-Fi Alliance for 802.11b/g/a in order to connect. See 
mywifi.intel.com for more details. 

Hyper-Threading Technology requires a computer system with a processor supporting HT Technology and an HT 
Technology-enabled chipset, BIOS and operating system. Performance will vary depending on the specific 
hardware and software you use. For more information including details on which processors support HT 
Technology, see here 

Intel® Turbo Boost Technology requires a PC with a processor with Intel Turbo Boost Technology capability. Intel 
Turbo Boost Technology performance varies depending on hardware, software and overall system 

configuration. Check with your PC manufacturer on whether your system delivers Intel Turbo Boost 

Technology. For more information, see http://www.intel.com/technology/turboboost 

Requires an Intel® Wireless Display enabled PC, TV Adapter, and compatible television. Available on select 
Intel® Core processors. Does not support Blu-Ray or other protected content playback. Consult your PC 
manufacturer. For more information, see www.intel.com/go/wirelessdisplay 

(Built-in Visuals) Available on the 2nd gen Intel® Core™ processor family. Includes Intel® HD Graphics, Intel® 
Quick Sync Video, Intel® Clear Video HD Technology, Intel® InTru™ 3D Technology, and Intel® Advanced Vector 
Extensions. Also optionally includes Intel® Wireless Display depending on whether enabled on a given system or 
not. Whether you will receive the benefits of built-in visuals depends upon the particular design of the PC you 
choose. Consult your PC manufacturer whether built-in visuals are enabled on your system. Learn more about 
built-in visuals at http://www. intel.com/technology/visualtechnology/index.htm. 

Intel® Insider™ is a hardware-based content protection mechanism. Requires a 2nd generation Intel® Core™ 
processor-based PC with built-in visuals enabled, an Internet connection, and content purchase or rental from 
qualified providers. Consult your PC manufacturer. For more information, visit www. intel.com/go/intelinsider. 
Viewing Stereo 3D content requires 3D glasses and a 3D capable display. Physical risk factors may be present 
when viewing 3D material 
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Security features enabled by Intel® АМТ require an enabled chipset, network hardware and software and a 
corporate network connection. Intel AMT may not be available or certain capabilities may be limited over a host 
OS-based VPN or when connecting wirelessly, on battery power, sleeping, hibernating or powered off. Setup 
requires configuration and may require scripting with the management console or further integration into existing 
security frameworks, and modifications or implementation of new business processes. For more information, see 
http://www. intel.com/technology/manage/iamt. 

No system can provide absolute security under all conditions. Requires an enabled chipset, BIOS, firmware and 
software and a subscription with a capable Service Provider. Consult your system manufacturer and Service 
Provider for availability and functionality. Intel assumes no liability for lost or stolen data and/or systems or any 
other damages resulting thereof. For more information, visit http://www. intel.com/go/anti-theft 

Requires an Execute Disable Bit enabled system. Check with your PC manufacturer to determine whether your 
system delivers this functionality. For more information, visit http://www. intel.com/technology/xdbit/index.htm 
Intel® vPro™ Technology is sophisticated and requires setup and activation. Availability of features and results 
will depend upon the setup and configuration of your hardware, software and IT environment. To learn more 
visit: http://www. intel.com/technology/vpro 

The original equipment manufacturer must provide TPM functionality, which requires a TPM-supported 

BIOS. TPM functionality must be initialized and may not be available in all countries. 

Intel® AES-NI requires a computer system with an AES-NI enabled processor, as well as non-Intel software to 
execute the instructions in the correct sequence. AES-NI is available on select Intel® processors. For availability, 
consult your reseller or system manufacturer. For more information, see http://software.intel.com/en- 
us/articles/intel-advanced-encryption-standard-instructions-aes-ni/ 

No system can provide absolute security under all conditions. Requires an Intel IPT enabled system, including a 
2nd generation Intel Core processor, enabled chipset, firmware, and software. Available only on participating 
websites. Consult your system manufacturer. Intel assumes no liability for lost or stolen data and/or systems or 
any resulting damages. For more information, visit http://www. ipt.intel.com 
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AMD: 


TWO Х86 CORES ТИМЕР 
ҒОН ТАНСЕТ МАНКЕТ5 


“Bulldozer 
Family” 


Performance 
& Scalability 


“Cat Family” 
Flexible, 
Low Power 
& Small Low 


Power 
Markets 


Mainstream Client and 
Server Markets 


Cloud 
Clients 


|. Optimized 


Improve power efficiency thru clock 
gating and unit redesign 


- Update the ISA/Feature Set 
- Increase Process Portability 
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TI >N 
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- 40 bit physical address capable не 
- Improved virtualization Е Los 
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"JAGUAR" СОМРИТЕ UNIT (CU) 


4 Independent “Jaguar” cores — CU SCU 
Shared Cache Unit (SCU) 


— 4 L2 Data Banks (total 2MB) 


To/from NB L2 Interface 


Core Core Core Core 
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— To/From 
*«—— — Shared Cache Unit 
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“Jaguar” Enhancements: 


= 4x32B IC loop buffer 
for power 


AWAN 


IU. 2v 
- ШЫ: 512 4KB pages 


= Layered branch predictor 
w/ state of the art 
conditional predictor 


= З2В fetch 
= 2-instruction decode 


= Improved ІС prefetcher 
for IPC 


= Grew 1В for improved 
fetch/decode decoupling 


= Added decode stage for 
frequency 


— To/From 
«—— — Shared Cache Unit 


7 | “Jaguar “ | HotChips 2012 


“Jaguar” Enhancements: 


= New hardware divider 
(leveraged from Шапо) 


- 2 ALU 
= 1 LD AGU 
= 1 ST AGU 


= New/improved cops: 
CRC32/SSE4.2, ВМІЯ, 
POPCNT, LZCNT 


= More OOO resources 
Larger schedulers, ROB 


— To/From 
*«——— Shared Cache Unit 
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“Jaguar” Enhancements: 
= 128b native hardware 
- 4 SP muls + 4 SP adds 
— 1 DP mul + 2 DP adds 
= ISA: many new СОР< 


— 256b AVX supported 
by double pumping 
128b hardware 


= New Zero Optimizations 


ООО scheduler 
= 2 execution pipes 


= Second ЕРНЕ stage 
for frequency 


— To/From 
<4—— Shared Cache Unit 
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- L2DTLB: 512 4KB pages 
- 8-stream DC prefetcher 
= OOO LS 


| )( 


— To/From 
<—- Shared Cache Unit 
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“Jaguar” Enhancements: 


= Ld/St Queues redesign: 
— Improved OOO picker 
— Improved STLF 
— Less store data shuffling 
— More OOO resources 
= Enhanced Tablewalks 
= 128b data path to FPU 
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To/From 
Shared Cache Unit 


“Jaguar” Supports: 
= 8 DC miss/prefetch 
* 8 IC miss/prefetch 


* Improved Write Combining 
with 4 WCB data buffers 


1 
' 
' 
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Microarchitectural Frequency improvement over “Bobcat”: >10% 


One additional cycle branch mispredict latency vs. “Bobcat” 


| HotChips 2012 
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tency 


AMD: 


Inst. Tag/TLB 
Instruction Cache 
x86 Decode 
| Ucode ROM 
Debug 
Integer жезде 
Data Tag/TLB 
ata Tag Load/Store 
ағ 
Bus Unit FP 
Data Cache Test 
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? РІ АМ COMPARISON 


7 
$ 
“Bobcat” core in 40nm = 4.9 mm^2 “Jaguar” core in 28nm = 3.1 mm^2 
7 core macros, 2 L2 macros, 3 core macros, 1 L2 macro, 
3 clock macros 1 clock macro 
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D CACHE UNIT 


i 2MB, 16-way 

E. 

. — Supported by 4 L2D banks 
_ of 512KB each 


L2 cache is inclusive — allows 
using L2 tags as probe filter 


— Any line in a Core L1 
instruction or data cache 
must be in the L2 
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rface block runs at core clock 


— L2D’s run at half clock for power, 
only clocked when required 


- New L2 stream prefetcher per core 
— Allows improved bandwidths & IPC 


= Up to a total of 24 paired read + write 
transactions in flight 


= 16 additional L2 snoop queue entries 
— Allows for handling coherent probes 


at high bandwidth 
——_ ———— AMD 
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pronto Relative C6 Latencies Under 
Normalized Conditions 


|. the remaining active cores (IPC) 

- Last core in the compute unit to be 
power gated flushes shared L2 in 
preparation for full C6. Hardware 


iE Eve e flush "Bobcat" C6 "Jaguar" Сб "Jaguar" CC6 
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improved power efficiency 


ue, L2 clocks, etc. 
, including improved dynamic clock gating: 


m 0.00 91.8 
Apps 0.95 1.10 B9 
“Bobcat” Virus 1.74 1.78 84.6 
“Jaguar” Virus 0.81 1.86 85.7 


= Increased frequency capability allows choices: 
— Higher frequency -> higher performance 
— Same frequency at lower voltage -> lower power 
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98.8 


92.3 


87.1 


85.0 


“Based on internal AMD modeling using benchmark simulations 
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ont is for informational purposes only and may contain technical inaccuracies, 


subject to change and may be rendered inaccurate for many reasons, including but not 

ges, component and motherboard version changes, new model and/or product releases, product differences 
ware changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct 
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| proAptiv: Efficient Performance 
on a Fully-Synthesizable Core 


MIS 


TECHNOLOGIES | 28 August 2012 


|. Ranganathan “Suds” Sudhakar 
| Chief Architect 


© 2012 MIPS Technologies, Inc. All rights reserved. 


Aptiv Family Highlights 


Three new cores optimized for embedded markets 
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Aptiv Core Portfolio 


Classic MIPS Products 
MP version 
1074K 
Series 
74K un 
Series Out of Order 
—- Dual issue 
MP version 
1004K 
Series 
34K x 
24K/24KE Series Multi-threading 
Series | 9-stage pipeline 
8-stage pipeline 
| | Miake | 
M4K/4KE Series | 
Series 
"m Code compression 
4 stage pipeline 5 stage pipeline 
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Aptiv'" Generation ———> 


Single-Threaded 
Area Optimized 


interAptiv™ 
Family Multi-Threaded 


Power Optimized 


1 to 4 core configs, 
2-level MT/FPU and 
L2 cache controller 


microAptiv™ 


Family DSP-Accelerated 
MCU (cacheless) or Energy Optimized 
MPU (caches/TLBs) 
with real-time/security 
Mrs 


* Fully synthesizable “package” 
= Design data 
e RTL 
e Configurator — MP/MT, FPU, Trace/Debug, cache/TLB/SPRAM/buffer sizes, bus widths 
= Physical design support 
e Reference floorplans, Synthesis + Place-and-Route scripts 
e DFT/Scan, Timing and Power Analysis scripts 
= Simulation models 
* Bus Functional Models and compliance checkers 
° Instruction accurate simulators, Cycle exact simulators 
= Verification collateral 
e Architectural Verification Test suites, core diagnostics 
e Sample testbench, build and run scripts 
= Documentation 
e ISA manuals, global configuration register tables, memory maps, boot procedures 
° Implementer's Guide, Integrator's Guide, Hardware/Software User manuals 


% Available separately 
= FPGA development boards 
= EJTAG/debug probes 
= OS components, libraries, software toolchains (compiler, libraries, JITs, codecs) 
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TECHNOLOGIES 


% Tapeout-ready GDS, built оп а generic ASIC flow, using: 
= Configured soft core 
= Floorplan — placement of RAMs, bounding box 
= Physical IP for some process technology 
e Standard cell library 
= e.g. 28nm low-leakage 12-track mixed-Vt with booster flops 
* Compiled memories 
= e.g. 28nm high-speed LVT single + dual-port bit-writable memories 
e Fab conditions 


" Process corner (usually worst-case slow-slow, high-temp, low voltage) 
= Number of metal layers, DRC/LVS, power grid, IR drop, OCV/AOCV, PLL jitter 


+ Not to be confused with a “hard core" 


K Frequency and power improvements beyond simple hardening: 
* Custom std cells, flops, clk-gaters characterized for typical silicon 
= e.g. 1.x GHz worst-case SVT > 2.x GHz typical with LVT, overdrive, cooling 
* Multi-port register files and custom memories for cache arrays 
= Hierarchical floorplans, structured placement, mesh clocking 


© 2012 MIPS Technologies, Inc. АІ rights reserved. 


TECHNOLOGIES 


Hardened proAptiv Layout 


Branch Prediction RAMs (1R1W) 


MBIST 
Engines 


l-cache RAMs 
Included 


TLB RAMs 
Stdcells, 


Clock tree, 
Power grid 


D-cache RAMs 
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* Life revolves around flops (and muxes) 

= No CAMs - schedulers, TLBs, BTBs all built from flops 

" No ROMs - div/sqrt lookup tables all built from gates 

" Nomultiports — Register files, reorder buffers all built from flops 
* Read ports are large muxes “О(пит entries) 
e Write ports are small muxes “О(пит ports) 

" Exceptions are: 
e 1RW RAMs for use in cache/TLB arrays 


e 1R1W RAMS for use in branch prediction arrays 
" Used judiciously -- proAptiv is the first MIPS soft core to use these 


% Sophisticated techniques cannot be easily employed 
e Banking, sum-addressing or one-hot-indexing 
* Dynamic circuits, especially negedge-triggered 


* More pipestages needed for a given frequency 
= MIPS's pure RISC ISA helps counteract this 
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% Timing paths not consistent 
= Variations in floorplan, configuration, stdcell, memory ІР 
= Variations in operating point — fab, process, Vt mix, overdrive 
= Variations in EDA tool margins, flows, vendors and versions 
% But good enough! 
= Balance logic across pipestages 
= Ensure loop paths are minimal and reflected in the microarchitecture 
= Ensure floorplan reflects critical unit and pin placement 
* Specific considerations for high-frequency pipelines 
= Any CAM-RAM structures take at least 2 clock cycles 
= Regfile read--bypass takes at least 2 clock cycles 
* Need to fix timing paths at all phases of the implementation 
" Synthesis, Place, Route, Clocking (No ECOs or manual tuning allowed) 
* Verification challenges 
= Dozens of configuration variables but still need high code+functional coverage 


+ 
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(9 Optional 


Cluster Power Controller 


Global Interrupt 
Controller 


Configuration 
Registers 


Interv Req 
Port Port 


L2 Cache 
Controller 


Main Memory Coherent IO 
Non-Coherent I/O Devices 


L2 Memory 
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Fast 
= Optimized for mobile computing and networking 
=  Multi-issue dynamically-scheduled operation 
= Deep pipeline to achieve multi-gigahertz operation 
= Brand new high-frequency FPU matched to core 


Efficient 
= Elegant balanced microarchitecture, not brute force width and depth 
= Minimal area for cost and leakage; fine-grain clock gating 
= Reduces the need for costly heterogeneous schemes 


Scalable 
= New 6-core Coherence Manager апа 256-bit L2 cache controller 


Robust 
=  Age-based scheduling, careful tuning of predictors/prefetchers 
= Easy to add features and performance, vary microarch parameters 


Feature set 
a МІР532 R3/ MIPS16e, DSP ASE v2, PDtrace v6, Enhanced VA 
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ЕШ Optional оа + Superscalar OoO CPU - 16 stage 


(іо Оп-Сһір Виѕеѕ) 


= Quad inst fetch 
= Triple bonded dispatch 


Instruction L1 Bus IF Unit " Inst peak issue: quad integer; dual FP 


Cache 
(32-64 KB, 4 way) 


ISPRAM - * Sophisticated branch prediction 
e ier HEN MEMBER MNA Memory and 10/11/12 BTBs, RPS, JRC, 
(4KB-1MB) ыы way predicted instruction cache 


Instruction Dispatch Unit 


% High performance, multi-level 
Instruction TLBs, way predicted data cache 


Issue Unit 
% Instruction Bonding makes six 
Execution Pipes issue pipes look like eight 


CorExtend 
VF 4 


cou 


% Fast integer divide, multiply and 
multiply-accumulate operations 


+ Dual Issue ЕРО 


 —' Data | : 
Management Scratchpad RAM L1 Cache т Higher speed (1 :1 with CPU) 


Unit (PMU) (4KB-1MB) (82-64 KB, 4 way) 


= Lower latency on most operations 
" Single-pass double precision 


Teese от DSPRAM IF " More parallelism and dedicated 
schedulers — more ops in flight 
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16 stage integer load pipeline 


4 cycles 2 cycles 4 cycles 4 cycles 2 cycles 


4 cycles 4 cycles 


Variable cycles 
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% Combine adjacent instructions into single bonded ор 


* Load/Store bonding makes one memory pipe look like two 


e.g. consecutive LW or SW instructions 


e.g. branch with certain instructions in delay slot 
* Fused compare-branch is already part of MIPS integer ISA 


1 DTLB, 1 tag array, single-ported data array saves area 
Single DTLB and cache access saves energy, power 
Occupies only 1 entry in various queues/buffers — more ILP 
Carried forward as one operation on L1-miss — more MLP 
Speeds memset, bcopy, strcmp, spill-fill, GPU communication 


* Design decisions 


Initially limit to two instructions, aligned addresses and ST 
° But designed to scale to four, misaligned accesses and MT 
Therefore, needs a bonding predictor in the front-end 
° Trained by LSU (memtype must be cacheable or write-combining) 
* Indexed by PC and other control flow information 
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emCopy Loop: 
rl, 0x0(r20) 
r2, Ox4(r20 


]w 
]w 


№ r3,0x8(r20) 
Iw r4, Oxc(r20 
Iw |. r5,0x10(120) 
lw | r6,0x14(120 
r7, Ox18(r20) 


Iw 
Iw Ox1c(r20) 


° 
2 


rë 


sw  rl,0x0(121) 
sw  r2,0x4(r21 
r3, Ox8(r21) 


SW 
sw 14, Oxc(r21 
r5, Ox10(121) 
гб, Ox14(r21 


sw  r7,0x18(121) 
sw  r$,0xlc(r21) 


SW 
SW 


addiu r20, r20, 0x20 
addiu r21, r21, 0x20 
bnez r23, Loop 
sub r23, r23, r22 
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Bonded stores һауе 3 source registers 

" 1 address and 2 data GPRs 

* Compared to 2 sources for ordinary stores 

= Requires 1 more read port at execute than unbonded machine 
Hence cracked into decoupled operations 

= STA (Store Address) - 1 reg source 

" STD (Store Data) — 2 reg sources 
STA reads cache tags and detects L1 miss early 

" Requires only 1 read port in load-store pipe 
STD delivers data to LSU in memory aligned format 

= Requires only 2 read ports 

= Thus avoiding the need for any pipe to have З ports 
Some stores are never cracked 

= e.g. Misaligned stores, where data depends on address 
Some stores are always cracked 


= e.g. РРО stores, where the integer scheduler has no visibility or control over the FP 
register file and issue ports 
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+ 


* Two issue queues 
= Neither single large unified queue (low-frequency) 
= Nor too many small distributed schedulers (high power) 


% 


% 


* 1 ALU issue queue and 1 AGU issue queue 
= Check dependencies and structural hazards 
= ОТА and STD share same scheduler entry, reducing area/power 


% 


* Age-priority scheduling 

= Requires age-vector per entry to pick oldest 

" Allows non-shifting schedulers with fewer comparators/muxes for low power 
= Minimal CAM logic — timing friendly 


% 


* No reservation stations 
= Read registers after scheduling — low power 
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* Holy grail of OoO scheduler design: 
= Large (40 — 64 entries) yet fast (able to follow single-cycle dependency chains) 


% Typical schedulers employ one of two wakeup techniques 
" Encoded register-number wakeup (e.g. MIPS R10K) 
(Wakeup > Pick > Mux) > (Wakeup > Pick > Mux) >... 
Pick and Mux can sometimes be overlapped 
= Decoded entry-number wakeup (e.g. MIPS 1074K) 
(Wakeup > Pick) > (Wakeup > Pick) >... 
Usually multi-hot vectors for dependency checking 


% proAptiv can utilize a third technique 
= Transitive Wakeup 
(Wakeup) > (Wakeup) > (Wakeup) > ... 


= Only works with decoded entry-numbers 
Relies on multi-hot broadcasts 
|21,2 2521,2 545 7) > (1,2, 3, 4, 5, 5, 7] 
= Requires strict age-priority scheduling and other constraints 
Prevents premature pick of a younger op dependent on an older op 
e.g. inst 6 before inst 4 


18 Е ЕН © 2012 MIPS Technologies, Inc. АІ rights reserved. Mrs 


9 
%% 


One simple ALU pipe 
= Handles arithmetic, logical ops and small shifts 
One complex ALU pipe 
= Handles a superset of the simple ALU ops — such as large Shifts 


= Handles DSP operations that involve reading or writing the 646 accumulators 
e Accumulators are renamed and treated as two 32b registers 
= Saves power and area compared to designs using 64b rename pool 
= DSP flags are renamed using separate 13b wide pool 
Allows easy handling of sticky status bit fields 


= Interfaces with Multiply-Divide Unit which also uses the accumulators 
° Supports single-cycle bypass for integer multiply-accumulate 
° New designs for fast multiplication and very fast division 


One branch/store-data pipe 
One load/store pipe 
Pipes share read and write ports to further bring down area/power 


Thanks to bonding, the 4 physical pipes can actually execute 
up to 6 MIPS32 integer instructions on a particular clock cycle 
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* Designed for large modern workloads 


" Enhanced Virtual Addressing (EVA) allows efficient access > ЗОВ 
* Via programmable segments and new kernel load-store instructions 


% LSU 
= OUt-of-Order operation: loads/stores can (with some restrictions) overtake each other 
* |mportant for performance 
* And efficiency (maximizes utilization of single load-store pipe) 
* Butrequires: 
= Excellent memory disambiguation and "RAW" hazard avoidance 
° Overeager Load Predictor accessed before insertion into scheduler 
" [SU CAMs detect failure to forward from store buffer and trains predictor 
° Mark a specific load as overeager 
= Predictor forces marked loads to be uneager 
* Scheduler holds overeager loads until all older STA and STD have issued 
= Enforce MIPS’ weakly-ordered memory consistency model 
° Store merging, lightweight and heavyweight SYNOs, cache-ops 
° FP stores can graduate even before receiving store data from FPU 


+ BIU 
" Write-combining and bonding to support streaming writes 
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Inst VA 
— 


% MIPS dual-entry scheme in TLBs 


Inst PA = Two VAs differing by 1 address bit share 
Inst TLB Q wF CAM/index portion of entry 


= Separate PA for each of the two VAs 


VTLB/FTLB % Instruction and Data TLBs 


= Holds 16KB or 4KB pages or sub-pages 
from VTLB/FTLB 


= 16 entry Instruction TLB 
= 32 dual entry Data TLB 


с =» * Fast adder-comparator logic 
TLB miss 


handling * Variable page size TLB (VTLB) 
= 64 dual entries, fully associative 
" Holds pages from 4KB — 256MB 


* Fixed page size TLB (FTLB) 
" 512 dual entries, 4-way assoc 


Data PA = Holds either 16КВ or 4KB pages 
Data TLB š " Optional at build and runtime 


=  SHAM-based implementation 
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% Brand new high-speed design 

= Сап гип 1:1 with proAptiv up to top achievable core frequency 

= Native double-precision datapath 

= FMAC-based pipeline with early and late bypass for FADD/FMUL 
* 4-cycle FADD, 4-cycle FMUL, 7-cycle FMAC 

" Low latency and high throughput for long ops like div/sqrt/recip/rsqrt 
e Functional iterative algorithms and lookup tables compared to bitwise SRT 
e Can run independent instructions under a long op, including other long ops 


% Coprocessor style FPU 
= Has its own decoupled pipeline, regfile and load/store interface buffers 
= Non-stalling design using shelving buffers to reduce power, improve perf 
* Lower power than PRF-style renaming, given flop-based implementation 


* Formal verification 
" Against a precise IEEE-compliant mathematical model 
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% Accompanies both proAptiv and interAptiv cores 
* 256KB to 8MB shared across 1 to 6 cores 


% 8-way associative 
% Selectable 32 or 64B line size 
% 256-bit internal datapaths and buffers 


% 


* Up to 256-bit interface to system interconnect 


* Optional wait states on tag, data or control RAMs 


" Accommodates slow memories, due to: 
* Large size or high-frequency operation 
* HD bitcells, pipelined RAMS, low-voltage operation 


% Optional ECC on all RAMs 
" Adds one pipestage 
% L2 storage non-inclusive to L1 
* Critical-word first; can interleave responses to multiple cores 
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proAptiv Dual-Core Floorplan 


Configuration: 


+ Per base core 
= ЕРІ) 
= З2КВ/З2КВ I/D 1195 


= [LB 
e | and D TLBs 
“ 128 епігу VILB 
“ 1024 entry FTLB 


= PDtrace 


% Cluster level 
" Dual core coherence 
= 1MB 12% with ECC 
= PDtrace aggregator 
" 64-Interrupt Controller 
= HW IO coherence 
= Cluster power controller 
= Probe interface block 
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proAptiv Quad-Core Floorplan 


1MB RAM 


CM2 


1MB RAM 


== 


System І/О 
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+ Fast 
" 4,5 EEMBC CoreMark/MHz 


“ Highest single-threaded score published for any licensable CPU* 
= 75% over prior MIPS 1074K core 


= Operating frequency > 1GHz worst-case, >> 2GHz typical at 40nm 
% Slim 
= Highest CoreMark/mm* for any licensable CPU* 
* Dual core area ~ 1МВ L2 cache 
* Cool 
" Highest CoreMark/mW for any licensable CPU* 
° Sub half-watt power at 40nm 


* Efficient performance on a fully-synthesizable core 


* CoreMark/MHz derived from publicly available and published scores at hito://www.coremark.org 
Area and power efficiencies based on MIPS internal and competitive estimates 
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Swizzle Switch: А Self-Arbitrating High-Radix 
Crossbar for NoC Systems 


Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, 
Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna Das, 
Thomas Wenisch, Dennis Sylvester, David Blaauw, and Trevor Mudge 
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= Swizzle Switch—Circuit & Microarchitecture 
= Arbitration 
= Prototype 
= Swizzle Switch—Cache Coherent Manycore Interconnect 
= Motivation & Existing Interconnects 
= Swizzle Switch Interconnect 
= Evaluation 
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Swizzle Switch 


Arb tree 
Control 
Es. 


il 
In TERRE alli 
In ST EEEEHEHEHEHEH MH | In4 
d» me Шота D қаламен s Ч [ іп 6 


In S ARR 
In 7 ЕН In 8 
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Conventional Matrix Crossbar Swizzle Switch 


Embeds arbitration within crossbar—single cycle arbitration 
= Re-use input/output data buses for arbitration 

SRAM-like layout with priority bits at cross-points 

= Low-power optimizations 

= Excellent scalability 
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Data Routing 


Рге-сһагде 


и Multicast & Broadcast 


IE = Bitlines discharged if 
m Data = к 


= Crosspoint = “1” 
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Swizzle Switch Architecture 


“ы. ое оне [o] 


Sense Amp : 
+ Latch— 
EAD 
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= Swizzle Switch Interconnect 
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Inhibit Based Arbitration 


— — — — ——— — 
-- 


— 


22221 ee This diagram is а single column in 
° uu "n the Swizzle-Switch (output), each 
ы Priority | Priority) output arbitrates/transfers data 


line / line n independently 
Req | 0: X < 


Each Crosspoint has a sense 
amp/latch to indicate connectivity. 
Each input samples a unique bit of 
the output bus to determine if it has 
been granted the channel 


Priority vectors are stored and when 
a request is issued they discharge 
bits along the output columns to 
INHIBIT lower priority requests 


Sense Amp > 
+ Гаісһ ° 
Finally, the priority vectors are 
Req n O updated when the data transfer 
„ые RU completes. 
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— ---. 


-- 


° Priority | Priority 


Ұр line / line n 
Req | x < LOWEST Priority 
0 Discharges МО Priority Lines 


INTERMEDIATE Priority 
Discharges SOME Priority Lines 


Req m Шету 
tn 


Sense Amp • 


+ Latch, = es HIGHEST Priority 


Reqn {©} E -Ibischarges ALL Priority Lines 
„м c 
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Least Recently Granted(LRG) 


— — — — — — — 
— —— 


-- 


— — — — 


4 Priority | Priority Example Arbitration: 


line / line n (1) Req land Req т Request the bus 


ігі O x< (red lines) 


(2) Нед m discharges Priority line /, 
priority lines m and n remain 
charged (green lines) 


Req m (3) Req І senses Priority line / and is 
inhibited (not granted), Req m 
0 senses Priority line m and is not 


Inhibited 


(4) The crosspoint records the 


Sense Amp > -" 


* Latch s Es connectivity at input m 
ui re 
кые БЕК: 


University of Ñ 


_ — — — — — 
— — — = 


— — — — 


М Example Priority Update: 


Input m signals it is done with data 
transfer by asserting Rel m 


1 
‚ RESET 


Req n 


OT x 


e а 
University о ет RE | 


— =- 


-- 


---. = — — 


Req | 0 INTERMEDIATE Priority 
Не! | EB HC | Discharges SOME Priority Lines 
aM LOWEST Priority 
2 m O Discharges NO Priority Lines 
el m~i 
0 0 


HIGHEST Priority 


Req n O Discharges ALL Priority Lines 
Rel n 
ы e e 
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---. 


-- 


Least 
priority 


Priority 


СО 
Rel | 

Т u 

° и upgraded 
Req т O | Priority 
Rel m7} S zx | unchanged 
— “т” releases channel 

Req n O | 
Rel п 
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Increasing Priority 
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64x64 Prototype 


Кым АА +, РЗ LT LL JM | 
А a 
se uM Ч № ЗЕЕ ЗИ IT 


< —238imm ^? 


34.2um 


оон ww ww w w < 
W ч w w 0 w ` 


15.6mm2 


Die area 
Fabric area, Transistor count, # Data wires | 4.06mm2, 6.95M, 8192 


4.47Tb/s (а) 1.1V, 559МН?, 25°C 


Throughput, Frequency 
Energy Efficiency at peak throughput 3.4Tb/s/W 
Peak energy efficiency 7.4Tb/s/W (а) 0.6V 14 
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Measurement Results 


600 Measured 
Fabric size : 64x64 (128 bit channel) Z 

регі ы 
—— o 
N = 
I 400 ^ ud = 
= 4.47ТЫѕ @ 1.1V 43 © 
aa 5 
> > 
c = 
Ф ©) 
> Technology : 45nm Е 
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ы. le 
LL. = 


/ жат 28Мне 
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Measurement Results 
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Scaling Interconnect for Many-Cores 


= Existing interconnects—Buses, Crossbars, Rings 
= Limited to ~16 cores 


= Other’s Interconnect proposals for Many-Cores 
= Packet-switched, multi-hop, network-on-chip (NoC) 
= Grid of routers—meshes, tori and flattened butterfly 


= Our Proposal 
= Swizzle Switch Networks 
= Flat single-stage, one-hop, crossbar++ interconnect 
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Mesh Network-on-Chip 
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Motivating Swizzle Switch Networks 


= Uniform access latency 
= Еаѕе of programming, data placement, thread placement... 


= Low Power 


= Simplicity 
=Packet-switched NoCs need routing, congestion management, flow 
control, wormhole switching.... 
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Motivating Swizzle Switch Networks 


Mesh SSN 
Average2@).017flits/ попе ace Average®-D.013flits/node/sec® 
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= Unfairness = Nodehighest throughput / Nodejowest throughput 
= Hotspot Traffic = All nodes sending data to nodes в 


= Under Hotspot traffic, the Crossbar has a slightly less 
throughput than the Mesh but is 40x more fair. 
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Motivating Swizzle Switch Networks 


Mesh SSN 
Aver .36flits/node/sec® Aver 80.47 И5/поде/ seck] 
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= In the Mesh, nodes closest to the center receive the highest 
throughput 


= Under Uniform Random traffic, the Crossbar has more 
throughput than the Mesh and is 87% more fair. 
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Motivating Swizzle Switch Networks © 
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= Arbitration 
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= Swizzle Switch—Cache Coherent Manycore Interconnect 
= Motivation & Existing Interconnects 


в | Swizzle Switch Interconnect 


= Evaluation 
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Top-Level Floorplan 
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Evaluation 


= Simulation Parameters 


64 in-order cores, 1 IPC, 1.5 GHz 
[1 Cache 32kB I/D Caches, 4-way associative, 64-byte line size, 1 cycle latency 


L2 Cache Shared L2, 16 MB, 64-way banked, 8- | Shared L2, 16MB, 32-way 
way associative, 64-byte line size, 10 banked, 16-way associa- 
cycle latency tive, 64-byte line size, 11 
cycle latency 
Interconnect 3.0 GHz, 128-bit, 4-stage Routers, 1.5 GHz, 64x32x128bit 
Lol 3 virt. networks w/ 3 virt. channels 


4096MB, 50 cycle latency 


= Benchmarks 
= SPLASH 2: Scientific parallel application suite 
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Results—Performance & QoS 
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Results—Power 
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= Swizzle Switch Prototype (45nm) 
= 64x64 Crossbar with 128-bit busses 
= Embedded LRG priority arbitration 
= Achieved 4.4 Tops @ ~600MHz consuming only 1.3W of power 


= Swizzle Switch Network Evaluation 
= Improved performance by 21% 
= Reduced power by 28% 
= Reduced latency variability by 3x 
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Arbitration Mechanism (Matrix View) 
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Round Robin Arbitration 
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Timing Diagram 
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Crosspoint Circuit 
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Regenerative Bit-line Repeater 
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Simulated bit-line delay improvement € 
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SSN Scaling: Simulation 


Technology : 45nm 
Supply : 1.1V 
Temperature : 25° C 
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Runtime Speedup Over 4-Cycle Mesh NoC 
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Hardware acceleration is Niche 


e (With the obvious exception of graphics for gaming!) 
* Even for people with a direct financial incentive to go faster... 
e The reason? 
It is enormously hard work! 
* |'m going to talk about: 
— Why itis so hard 
— How to make it easier 


— Using online trading applications as an example 
Industry perspective 


SOLARFLARE* 
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Trading applications 


* Traders аге in a latency race 
Signal > Decision Э Action 


* Whoever responds to signals fastest, makes money 


* These folks have money to spend on technology, and exceptional 
engineers in-house 


e But it is not a case of performance-at-any-cost. Like everyone else, 
they must balance performance against: 
— Flexibility / speed of deployment 
— Available skills 
— Cost 
— Compatibility 
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Trading applications 


Broker | | 


«-»- 


Trader Trading 
venue 
Ultra-low 
latency 
trader 
ы Market data Order execution 
(UDP multicast) (TCP) 


...Of course there is lots more that we don't have time to go into... 


But all participants have a financial incentive to reduce latency, and 
many also have a throughput challenge. 
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How does the МІС help? 


° Low latency 
cut-through 
design 


1024 VNICs ` 
per port 


P d device driver 


kernel 


VNIC == Virtual NIC. — "7 


— Independent interface for sending 
and receiving packets 


* Flow steering 
— Direct individual flows to specific VNICs 
— Supports scaling and NUMA locality 
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Kernel networking 


* Traditionally the network stack 
executes in the OS kernel 


* Received packets are processed in 
response to interrupts 


ECCO e Applications invoke the network via 
the BSD sockets interface by making 


Т И В ü system calls 
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Kernel bypass - OpenOnload 


* Dedicate a VNIC per application or 
thread 


e TCP/UDP stack as user-level library 
* Critical path entirely at user-level 


* Reduces per-message CPU time 
— Cuts latency in half 


— Increases message rate by 5x per 
core 


— Improves scaling 


* Fully compatible — no changes to 
applications needed 
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What is so hard about FPGA acceleration? 


* Let's assume you want some custom logic 
— Evaluate the available board options 
— (Expensive > low volume Ə expensive > low volume...) 


e FPGA image: 

— Development tools 

— You'll need some IP blocks: 
* PCle engine 
° Media access controller (МАС) 
• Memory controller 

— Boilerplate 
* Packet handling: Parsing, demultiplexing, buffering, streaming 
* Protocol handling: Checksums, headers, address resolution, TCP, UDP 
“ Managing physical links: Configuration, errors, statistics, flow control 


These need 
licencing, 
configuring etc. 


e Host software: 
— Device drivers 
— Control path 
— Fastinterface to application (kernel bypass) 
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e Let's assume you want some custom logic 
— Evaluate the available board options ехе 
— (Expensive > low volume "> expensive № \( 

e FPGA image: a NW ON 


— Development tools 


— You'll need some ІР ех © „ m Nese need 
RAS. 
| 0 299 a NO | 

ХӘ 


- Boilerplate 2 ° 
e Packet һПапс®®о: NEN Or. A g, streaming 
Protocol handlin QN 5, oe NM TCP, UDP 
Managing ph 9 comune OS statistics, flow control 
* Host software: 
— Device drivers 


— Control path 
— Fast interface to application (kernel bypass) 
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So what is wrong with existing offerings? 


e Far too much work needed to create a deployable application 
— Apart from cost and time, the FPGA dev skills just aren't available 
> Need to provide the boilerplate (at the very least) 
> Need a much simpler host interface 


* Hard to deploy incrementally 
— Requires simultaneous changes to multiple components 
— Accelerator network interface can only be used for accelerated traffic 
— Consumes an extra PCle slot 
> Needs to integrate with existing apps 


* Expensive 
— Requires huge benefit to justify investment 
> Need a solution that is widely useful (so we can make lots of them) 
> Need off-the-shelf applications to sell in volume 
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Solarflare’s Application Onload Engine 


ApplicationOnload" Engine 


e Not an FPGA with an Ethernet interface: 
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Solarflare's Application Onload Engine 


ApplicationOnload" Engine 


e Out of the box it works like a regular Solarflare network adapter 
— Drivers included 
— Works with kernel network stack and kernel bypass (OpenOnload) 


• Incremental upgrade 
— Pass-thru by default 
— Accelerate a subset of traffic 
— No new switches, cabling, slot 


* Solarflare & 3'd party applications 
— Solve common problems 
— No FPGA expertise required 


FDK (developer kit) 
— Reusable IP blocks to minimise effort for FPGA developers 
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АОЕ: Block diagram 


ApplicationOnload" Engine 


Mictor 
Debug | 
Connector 


Direct 
ackdoor 


Altera Config b 


X8 PCle Gen 1/2 SMBus 
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Host software interface 


ApplicationOnload" Engine 


* [he data path is just packets 


— Applications on the host use BSD sockets 
e Via the kernel stack 
* Orvia kernel bypass for higher performance 


* We also provide a register bus 
— Mastered via a software АР! or command line tool 
— FPGA applications expose registers and memory 
— Notifications 
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What are the negatives? 


* Compared to host-attached-FPGA 
boards: 


— AOE has higher latency between FPGA and host 


— AOE has no direct access to host from FPGA 


* Harder for FPGA apps to access host memory 
— Software on critical path 


— Much higher latency ПЕК 
ЕРСА | 
• FPGA can’t master other devices m 
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Alternative architectures? 


* Bearing in mind we didn't want to change the ASIC 


PCle attached FPGA 


* Pros: 
— Fast access to host from FPGA 
— Better latency for pass-thru 


NIC 
— АСІС 
LICE 


PCle 
switch 


e Cons 
— Increased complexity 
e FPGA interacts with NIC ASIC via descriptor rings 
“ Need new interface between FPGA and host 
— PCle соге in FPGA 
“ (Less space for other things) 
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Alternative architectures"? 


* Bearing in mind we didn't want to change the ASIC 


PCle attached FPGA 


* Pros: 
— Fast access to host from FPGA 
— Better latency for pass-thru 


e Cons 
— Increased complexity 
e FPGA interacts with NIC ASIC via descriptor rings 
“ Need new interface between FPGA and host 
— PCle core in FPGA 
“ (Less space for other things) 
— Significantly worse latency between host and wire 
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Alternative architectures? 


* Bearing in mind we didn't want to change the ASIC 


e Add PCle interface to FPGA 


* Pros: 
— Fast access to host from FPGA 


e Cons 
— Increased complexity 
“ Need new interface between FPGA and host 
— PCle core in FPGA 
“ (Less space for other things) 
— Slightly worse latency through NIC 


SOLARFLARE* 
Copyright € 2012 Solarflare Communications, Inc. Slide 20 —9 


Alternative architectures"? 
e And if we could change the ASIC? 
* Add fast bus for host access 


* Pros: 
— Fast access to host from FPGA 


“ Cons 
— Increased complexity 


— New interface between FPGA and host 
° But at least we're backwards compatible, so optional 


SOLARFLARE® 
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Example 1: Dual-line arbitration 


ApplicationOnload" Engine 


* Market data is published as a pair 
of redundant feeds 


* Traders often subscribe to both 
— For reliability 
— To get lowest latency 


* Line arbitration converts the pair of 
streams into a single feed 


š 
Ñ 
ua 
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Example 1: Dual-line arbitration 


ApplicationOnload" Engine 


mn Device driver 


a 
N 
ua 
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Example 1: Dual-line arbitration 


ApplicationOnload" Engine 


e Line arbitration in the FPGA 
accelerator 
— Application sees a single stream 


* Application gets the benefits of 
dual-line arbitration with half the 
data rate 


NA Device driver 


* More likely to keep up 
— Reduces queuing delays 


— Reduces likelihood of 
unrecoverable loss due to buffer 
overflow 


* Nochanges to software! 
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Ехатріе 2: Symbol splitting 


ApplicationOnload" Engine 
* Market data packets contain 


messages for multiple securities 
(symbols) 


* Single or few packet streams 


* Distributing load is a problem 

- May only be interested in a 
subset of symbols 

— Or may want to distribute load 
over multiple processes or 
threads 

— Must process messages in order 
(per symbol) 


* Demultiplex in software is 
inefficient 
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Example 2: Symbol splitting 


ApplicationOnload" Engine 
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Example 2: Symbol splitting 


ApplicationOnload" Engine 


e Split market data stream into per- 
symbol streams 


NIC distributes streams across 
processes, threads, cores 


* Much higher throughput possible 


— More efficient because we've 
eliminated thread/cache 
interactions 


* Lower latency 


Discard symbols we don't care 
about 


— Reduce throughput and queuing 
delays 
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Developer Kit Framework 


ApplicationOnload" Engine 


* Next we show how the symbol splitter is implemented in the FPGA 
* Using reusable components 


* Connected by a streaming packet bus 
— Based on Altera's Avalon-ST streaming interface 
— Carries packets and/or messages 
— Meta-data words are interleaved within packets 


* Components connected by packet bus may: 
— Inspect packets and add meta-data 
— Mutate packet data and meta-data 
— Pass-thru meta-data they don't recognise 
— Take actions based on meta-data 
* Manipulate state (lookup-tables, databases) 
e Routing decisions 
Buffering (FIFOs, off-chip memory) 
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Symbol splitter implementation 


ApplicationOnload" Engine 


IP MSG MSG 
© 
MAC 
MSG 
STI 
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Single meta-data word а start of each packet 


ApplicationOnload" Engine 


Meta-data: 
ingress port 
timestamp 
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Parse headers, add meta-data 


ApplicationOnload" Engine 


Protocol 
Addresses: 
VLAN, IP, port 
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Lookup steam, add meta-data 


/ 
HE 
Ü Stream-ID 
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ApplicationOnload" Engine 
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Parse eligible packets into messages 


ApplicationOnload" Engine 
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Pass-thru other packets 


ApplicationOnload" Engine 


* Errors during parsing also lead to pass-thru 
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Split packets at message boundaries 


ApplicationOnload" Engine 


* Original headers are discarded at this point 
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Lookup trading symbol mapping 


ApplicationOnload" Engine 


* Assign integer ID for each output stream 
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Symbol stream-ID selects FIFO 


ApplicationOnload" Engine 


IP MSG © MSG 
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Arbitrate amongst symbol streams 


ApplicationOnload" Engine 


M Arbiter 


Selects input based on packet 
meta-data and/or fill-level. 


Can be custom logic. 


L^ Trade-off between packet rate 
and latency 
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Stitch messages back into packets 


ApplicationOnload" Engine 


MAC 
Stitcher 


Combines messages into 
a packet. 


Prepends headers 


* For minimum latency at low rates: One message per packet 
* Packet rate limit per stream forces multiple messages per packet 
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...and deliver to host 


ApplicationOnload" Engine 


IP MSG о MSG 
MAC 
MSG 
STI 
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Custom apps: How much work? 


ApplicationOnload" Engine 


e FPGA image 
(some standard blocks) 


* App/FPGA interface 
* App integration 
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Custom apps: How much work? 


ES 
E 


ApplicationOnload" Engine 


Device driver Device driver 


App 
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Custom apps: How much work? 


ApplicationOnload" Engine 


e FPGA business logic 


Device driver 


* App integration [ J 


App 
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ApplicationOnload" Engine 


e Solarflare’s Application Onload Engine 


Practical acceleration for network applications 
— Much less work to offload custom business logic 


— Supports incremental deployment 


Shipping now 


First deployments being used for enterprise messaging and market data 
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Mellanox 


ТЕ С PWN EPO SIE S 


SwitchX 
Virtual Protocol Interconnect (VPI) 
Switch Architecture 


бегуег / Сотрше Switch / Gateway М: du A 


Virtual Protocol Interconnect 
56G IB & FColB 


10/40GbE & FCoE 


Virtual Protocol Interconnect 
56G InfiniBand 


10/40GbE 
Fibre Channel 
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МНН 
Mellanox MU | 


TECHNOLOGIES 


| Mellanox a қ Mellanox 


ети EN Еп HNDLOGIES 


АТЫНА | 


TECHNOLOGIES 


witch 
" Fifth generation switching IC from Mellanox "Жа йкес 
М \ Mellanox 
= Virtual Protocol Interconnect (VPI) technology – 
‘One-Wire’ fabric for InfiniBand — Ethernet — Fibre Channel traffic 


witch 


= Provides Highest Capacity, Lowest Latency, Lowest Power consumption in the Industry 
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SwitchX™ Performance and Configuration Flexibility 


Mellanox 


TECHNOLOGIES 


ELLE 


PERFORMANCE утс vPI Switch 


"4Tb/s Т? 
"36 x 40/56G Unified Fabric Manager 


Switch OS Layer 


= 200ns Latency 
"40 Watts (© 64 10GE 


«55 Watts @ 36 40GE Be В = ЕЗ £ + 


$ЕТИР SYSTEM SECURITY PORTS FABRIC MGMT FABRIC INSPCTR STATUS 


* 64 ports 10GbE 
• 36 ports 40GbE 
* 48 10GbE + 12 40GbE 
* 36 ports IB up to 56Gb/s 
* 8 VPI subnets 


1U switch configuration options 
36 Port FDR IB | 
36 Port 40GigE VPI IB/Ethernet 
64 Port 10GigE VPI IB/Ethernet 
12 Port 40GigE/48 Port 40GigE VPI 


1U switches 


Blade switch configuration options 
= 16 - 40GigE to servers 


= 12-10GigE to LAN Blade switches 


= 8G FC to SAN/2 - 40GigE stacking ports 


Modular switch chassis options 
" Up to 648 56G IB ports 
= Up їо 648 40GigE ports 


Modular switches 
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Ethernet Switching Performance тале 


TECHNOLOGIES 


p" 


Latency (ns) 
O 
© 


= Throughput (2.5Х) 
° 2.88Tb/s throughput on a single chip, running Full Wire Speed at any packet size 


" L2 UC/MC Latency for L2/L3 switches (2X) 
° 198-223ns for any packet size 


= L3 Latency (2X) 
* 321-337ns for any packet size 


" Power Efficiency (6X) 
• Sub 0.6Watt рег 10GbE throughput with 100% load at Full Wire Speed 


Switch 


E 


m L2 Min 
400 06 0 0 2 222 223 mL2 Average 
200 
; MES NES | = _ к 


64 128 256 512 1,024 1,280 1,518 2,176 9,216 
Packet Size (Bytes) 


Feature Overview 


Mellanox 


TECHNOLOGIES 


Physical Layer 


L2/L3 Protocols 


Bridges x — 


QoS 


Management — 


Other Features 
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SFI XAUI/RXAUI/XLAUI KR/KR2/KR4 - 10/20/40/56 GigE 
IB SDR, DDR, QDR, FDR10, FDR - 10/20/40/56/100 Gb/s 
2/4/83 Gb/s Fibre Channel 


Ethernet, Datacenter Bridging, TRILL , QCN, 
Fibre Channel Forwarder, IP Routing, IB Routing 


Ethernet to Fibre Channel Gateways (NPIV) 
IB-Ethernet апа IB-FC Gateways 


802.1p and DIFFSERV classification, marking 
Access Control Lists (ACL, eACL) 


Dual SGMII 
PCle Gen2 x4, or PCle x1 and Dual SGMIII 
GPIOs, І2С, JTAG 


Energy Efficient Ethernet, Active Power Governor 
IEEE 1588 time stamping 
40nm process, 1924 pins in 45x45mm FCBGA 1 mm pitch 


Converged I/O Solutions 


Mellanox 


TECHNOLOGIE 


Converged ІІ 
Network Ü 
Managen ent 


2/4/8Gb/s FC Ports 


1/10/20/40 Gb/s Ethernet Ports 


TRILL R-Bridge L2 Switching 


IP Router 


= 2.8/4 Тр/5 lossless switching = NPIV, FCF based native FC ports 


" 64 10GigE, 36 40GigE • 2/4/8 Gb/s 

= Flexible mix of ports ° N, VN, F, VF, E, VE port types 
* i.e. 48 10GigE, 12 40GigE * Soft and hard zoning 

" Multi-chip, high port count = Sample port configuration 
configurations * 40 10GigE, 24 8Gb/s FC 
* Efficient cluster scaling • 52 10GigE, 12 8Gb/s FC 
* Fat tree scaling ° 24 40GigE, 24 8Gb/s FC 
* Adaptive routing + 30 40GigE, 12 8Gb/s FC 
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Mellanox 


TECHNOLOGIES 


Virtual Protocol Interconnect (VPI) 10 Convergence 


10/20/40/56 Gb/s Ethernet Ports 2/4/8 Gb/s FC Ports 


NPIV FC GW 
AA Mellanox 


OO O O O_O 
IB Router | FC Forwarder 


witch 


ІВ Switching 
NPIV FC GW 


10/20/40/56 Gb/s 10/20/40/56 Gb/s 2/4/8 Gb/s 


ІВ Ports Ethernet Ports ЕС Ports Converged I/ O with Ethernet 


Repurposing 


Converged 1/0 with InfiniBand 
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Switch Partitioning 


= Up to 8 switch partitions can 


be activated 1/10/20/40 Gig Ethernet Ports 2/4/8Gb/s FC Ports 
= Flexible number of Ethernet, tT T [Í (1 
FCoE, FC port assignments D л 
рег switch partition ишити; puri - 
= Separate L2 data and control Switch Partitions — >Ë € 
plane domains РЕ тыа 


Virtual Routers 


Multiple Virtual Routers 


Separate address space per 
VR 


Isolation and fault | === 
containment TRILL R-Bridge L2 Switching 


IP Router 
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Control Plane Multitenancy 


Mellanox 


ECHNOLOGIES 


" Multiple switch partitions can be instantiated 
° Like virtual switches inside physical switch 
Cloud Deployment 
* Complements virtualized servers and storage Customer A  CustomerB Customer C 


* Control/data separation like separate switches 


" Flexible # of ports & personalities 


Apps | | | 

* Per switch partition, e.g., IB, L2+ Eth, FC | | өө | өө | өө | 
ervers ! | ! | 

or VMs | ШШ ИШ B B x 

Data & ! 

Switch ЕЕ — 

Networks : | 


With Switch Partitions MEN КЕШЕ Н 
Storage 


[ Supports evolving cloud & multi-tenancy architectures 
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Switch Port Virtu 


Mellanox 
= Flexible VSP Allocation = Hairpin Mode per VSP 
DIS VERS SNS ДЕ = Switch Partition per VSP 
e 8 VSPs on 36 ports е | 
* 4 VSPs on 64 ports SVID Allocation 


= IEEE 902.10 
* Bridge Port Extension 
* VLANS 
* Traffic Prioritization 


VSP = Virtual Switch Port 
SVID = S VLAN ID 


Blade Switch Module using 
SwitchX 
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SwitchX Multi-chip Stacking 


Mellanox 


" Chain and ring topologies 
" Any port can be a stacking port 


" Single point of management across stacking units (SU) 
* Efficient Inband configuration over management datagrams (eMAD) 
= System resiliency 
• Any SU can take charge of the system 
* Alternate paths dynamically used when stacking link down 
= Cross system features 
* Link aggregation — ports across SUs in same LAG group 
• ACL — same policy to ports across SUs 
- e.g. VLAN ACL 
* Unified tables are populated on all SUs 
- e.g. L2 filtering DB, L3 routing tables 
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Mellanox 


TECHNOLOGIES 


Multi-chip Configuration - Topologies 


а 


2 Layers ҒАТ-ТКЕЕ 2 Layers ҒАТ-ТКЕЕ 


-Single Spine Chip 


Siih 


36 x 40G 


та стун с J 48x 10€ 


Up to 1728 10GE Ports Up to 144 10GE Ports 
Back to Back 


= Non Blocking 

" L2 and L2 Multicast forwarding 

= Link Aggregation across fabric 

= Port Mirroring across fabric 

= Seamless class of service support 
Up to 96 10GE Ports = Preserving VLAN membership 


This slide does not present all possible configurations — but rather most reasonable multi-chip configuration topologies 
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Flexible Switch Management Interface 


Mellanox 


TECHNOLOGIES 


" Single chip 
* x4 PCle or dedicated 


GigE network 
mete x4 PCle 


= Multi chip 
* x4 PCle to inband 
fabric (Eth or IB) or 
dedicated GigE 


GigE 


witch 


network 
Dedicated 
GigE 
GigE | | 
Module x4 PCle ich Inband fabric | AA kuya 
45 (Eth or IB) | | 
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SwitchX Packet Flow Overview 


Mellanox 


TECHNOLOGIES 


= Fully Pipelined Implementation 
= Low-latency cut-through switching support 
= Wire speed forwarding 


Control Plane 


Policy Engine Forwarding Engine Policy Engine 
patie | Ingress L2 | Egress Queuing packet 
c ies Policy m sai Forwarding Policy And Е 
Classification Р 2 К Re-write 
Engine ed Engine Engine Engine Scheduling 


Data Plane 


a 
Data 
Buffer 


= Switch Crossbar 
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Forwarding Engine — Packet Flows 


AA 


Mellanox 


TECHNOLOGIES 


= Packet Payload 


= Packet Header 


= Stacking Control Data 


Æ wy 


Stacking (40GE / 56GE) 


L3 Engine 
L2 Engine 
| | 


Data 
Buffer 


ШР 


10GE / 40GE 
MAC 


10GE / 40GE 


Stacking (40GE / 56GE) 
Queuing and 
Scheduleing 
L3 Engine 
Data 
Buffer 
L2 Engine 
Stacking (40GE / 56GE) 
MAC 


Data 
Buffer 


mam 


Stacking (40СЕ / 56GE) 
MAC 


æ] 


ACLs Architecture Overview 


Mellanox 


TECHNOLOGIES 


ACL (Protocol Aware) ACL (Protocol Aware) ACL (Not Protocol Aware) 


ni ем —al Region 7-12 Rules 


Region 3- ІРуб Rules 


Rule Binding - Nesting AA. 


Mellanox 


TECHNOLOGIES 


ACL {Protocol Aware) ACL (Protocol Aware} ACL {Not Protocol Aware) 
Region 1 Region 4 
--Мап-ІР: -12 --Моп-ІР - 
Rules 


Region 5 – IPv4 


Rules Region ? — 12 Rules 


Region 6 — IPv6 Rules 


ACL (Not Protocol Aware) 


Rule Binding - Break AA 


Mellanox 


TECHNOLOGIES 


ACL (Protocol Aware) ACL (Protocol Aware} ACL {Not Protacol Aware) 


Region 7-12 Rules 


ACL (Not Protocol Aware) 


Region 7-12 Rules 


МеПапох 


TECHNOLOGIES 


Queuing and Scheduling 


" 8 Traffic Classes 
= ETS Scheduling 
= Mirroring/Replication | | | 
" UC/MC Flows = cm TUE K 


Limiter Limiter Limiter Limiter Limiter Limiter 


Group 0 Group 6 Group 7 
DWRR / Strict DWRR / Strict DWRR / Strict 


Strict / DWRR (ETS) 


Strict (TCG15) 


Rate 
Limiter 


Control Plane 


r 1 — 1 І 
Policy Engine Forwarding Engine Policy Engine 
Ingress | | Egress 
Packet : [ ) Packet 
Classification Policy | Policy | And Rewrite 
Engine | |. Engine Scheduling 


SwitchX TOR Portfolio 


Mellanox 


TECHNOLOGIES 


/ = Capacity " Latency 
* 36 40GbE ports • 220ns latency 40GbE 
- 64 10GbE ports - 830ns L3 latency 
+ 48x10GbE+12x40GbE combo er ne ЛЗ sobe 


- 430п L3 latency 
° Various other port schemes via 
breakout cables = Throughput 
° 2.88Tb/s of non-blocking throughput 
= Key Features : TE 


" Power 
* L2/L3 stack 
. VPI e Under 1W per 10GbE interface 


- 56GbE e 2.3W per 40GbE interface 
| e 0.6W рег 10GbE of throughput 
x ° End to end solution 


© 2012 MELLANOX TECHNOLOGIES 20 


" 648 x QSFP 40СЕ* ports 
= 1152 x SFP+ 10GE* ports 

= 51.84Tb/s throughput 

= 9.6 Watt/40GE port 

= Latency: 700п$ inter line, 230ns same line 

" World's first Cut- Through modular Ethernet switch 
= N+N PS Redundancy 

" | 2/L 3 SW Stack 

= Same Chassis is used for IB FDR (56Gbps) 

" Smaller Chassis (324p, 216p, 108p) 


e Same leafs, spines, management boards 
* Same architecture 
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THE SURROUND COMPUTING ERA 


Mark Papermaster 
Senior Vice President and СТО, AMD 
Hot Chips Symposium 


Cupertino, CA 
August 28, 2012 


AMD 


А RAPIDLY CHANGING ENVIRONMENT 


її content anytime, any platform, anywhere 
Explosion of unstructured data 
245 exabytes of data crossed Internet in 2010! 


Growing to 1000 exabytes in 2015 


Data center server demand >10М units by 20162 


1. Cisco Visual Networking Index Global IP Traffic Forecast, 2010 to 2015 
2. Worldwide and Regional Server 2012-2016 Forecast, IDC, May 2012 
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OLUTIONARY TRANSFORMATION 


10 years ао 


L 
е, 


- Graphics acceleration enabled 
1puting accessible to everyone 


Starting now: The Surround Computing Era 

- Computers are everywhere 

- Integrating into our environment 

= Computing is part of everyday life, not a distinct activity 
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- Multi-platform = eyeglasses to room-size 
_ Fluid — realistic output, natural human input 
_ Intelligent — anticipates our needs 


Profound implications for computer architecture 


Smarter clients — realistic, natural human communication 
Smarter clouds — orchestrate 10B devices in real-time 
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Natural Ul and 
Gestures 


Touch, gesture 
and voice 


Content 
Everywhere 


Content from any 
source to any display 
seamlessly 


Biometric 
Recognition 


Secure, fast, accurate: 
face, voice, fingerprints 


Beyond HD 
Experiences 


Streaming media, new 
codecs, 3D, 
transcode, audio 


Augmented 
Reality 


Superimpose graphics, 
audio, and other digital 
information as 
a virtual overlay 


AV Content 
Management 


Searching, indexing and 
tagging of video and 
audio. Multimedia 
data mining 


New Surround Compute Applications and Experiences — Accelerators Required! 
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AMD: 


= 


SMARTER CLOUDS 


The Cloud is 
the “Backbone” of 
Surround Computing 


Surround Computing Cloud Services 


» Trust Context 


Analytic 
Compute 
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Connected devi 


1 Scale — to support tens of billions of connected devices 
Acceleration — back-end NUI, graphics, analytics 


Security, privacy — consistent end-to-end architecture 
Real time - latency is critical 


Dense servers — optimized for low power 
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ORWARD 


Е саа key client апа server parallel workloads 


Heterogeneous System Architecture (HSA) 
= New silicon architecture making it all work together 
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CHANGING THE THINKING, CHANGING THE GAME 


Te "e in C, С++, Java, Python, JavaScript, HTMLS 
ISA agnostic 


GPU = CPU in terms of processing capability 
Full programming language features 
Shared virtual memory: pointer is a pointer 
Coherency and context switching 


HSA Foundation is an industry-wide initiative 
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BENEFITS OF HETEROGENEOUS SYSTEM ARCHITECTURE 


Easy to 
\ Program 
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HSA Accelerated 
Processing Unit 


APU іс the breakthrough app enabler 


APU enables рага! 


Data Parallel 
Workloads 


_ Emerging workloads require: 
- Seamless execution across CPU/GPU 
- Other specialized engines 


rial and Task 
Parallel 
Workloads 


APU is the platform of choice 


AMD: 
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AMD “STEAMROLLER” CORE 


nes а core execution 
Push on performance/watt 
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Decode 


% 


Integer 
Scheduler 


L1 | L1 DCache | 


“Steamroller” 


Fetch 


1 | 


ЕР Scheduler 


Shared L2 Cache 


Decode 


* 


Integer 
Scheduler 


L1 nr" 


:ED THE CORES FASTER 


£ 
De 


ficient dispatch 
e instruction pre-fetch 


To INT-1 


To FPU 


To ІМТ-0 


spatches per 
Thread? 


1.Based оп AMD’s internal simulation results of average workloads of simulated performance оп a number of tests, including those testing transaction processing. 


(Systems have to be publicly available to publish SPEC CPU Rate.) 
2.Based оп AMD’s internal simulation results of average workloads of simulated performance on a number of tests, including those digital media, productivity and gaming 


AMD: 


applications. 
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In concert with feeding the core faster 
Моге register resources, same latency 
More intelligent scheduling 


Design to decrease average load latency: 


Minimum latency is only part of story 
Faster handling of data cache misses 
Accelerate store-to-load forwarding 
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”: IMPROVING SINGLE-CORE EXECUTION 


Integer 
Scheduler 


L1 | L1 DCache | 


Major 
improvements in 
store handling 


Integer 
Scheduler 


FP Scheduler 


5-10% 
Increase іп 
Scheduling 
Efficiency! 


|| pem 


STEA LLER " PERFORMANCE/WATT DESIGN 


Microarchitectural power optimization "Steamroller" 


з ауегаде аупатіс ромег Fetch 
_ Optimize for loop behaviors Decode Decode 
Floating point rebalance Integer J | Integer 
ч š Scheduler Scheduler 
Streamlined execution hardware FP Scheduler 


Adjust to application trends 


Adaptive mode based on workload Ш | i | 


Dynamic resizing of L2 cache 


L1 nr" 


о 
< 
= 
ТЕ 
= 
о 


128-bit ҒМАС 
MMX Unit 


L1 | L1 DCache | 


Shared L2 Cache 
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— Lower energy consumption and utility bills 
- Lower data center TCO 


Multi-faceted attack beyond process technology 
= Optimize hardware with software applications 

= Intelligent on-die power management 

= Efficient design methodologies 
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AMD: 


ARCHITECTURAL EFFICIENCY EXAMPLE WITH VIDEO ЕМНАМСЕМЕМТ 


MOTION DSP 720P 


Performance 


CPU+GPU 
n Eu +... ПИ ИТ 
Synergistic use of GPU Compute ` 
+ shared memory 


CPU+GPU 


— 


lower power and higher Efficiency’ 


performanc 


| | T. AMD ES-3200 APU )Llano-32nm, 2 cores @ 2400Mhz, GPU:2 СО @ 444Mhz), Windows | AMDA 
17 | Тһе Surround Computing Era | Hot Chips - August 2012 | 7 OS, MotionDSP vReveal Applications (htip://www.vreveal.com/stabilization) 


E 
7» 


Temperature 
Temperature 
Temperature 


AMD: 


Enabled by sophisticated on-die microcontroller and sensors 
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+ 


“Bulldozer” 

Part of the Floating Point Unit. Hand- 
drawn for maximum speed and density 
in 32nm 


анис 
ау инь 


| — = 


di E 

d LE 
E 
КЕТ: 
p 
E 
SENS 

с i 

EB т 

7 TTS Ы > 

ER A PE 


With High Density Library 
The same blocks again, but rebuilt 
using a High-Density cell library to 
achieve 30% area and power 
reductions 


designs — same order as a full process node improvement 
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15-3096 lower energy per operation! for power constrained 


25 ПЕ EVERYTHING TOGETHER 


| interconnect fabrics are needed 


_ Optimally process unstrt 


гы 
À 


t massive numbers of processors 


st possible overhead 
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AMD: 


AMD FREEDOM Е, АВЕС" TECHNOLOGY 


AMD off-chip interconnect fabric IP 

* Designed to enable significantly lower ТСО 

= Links hundreds >> thousands of SoC modules 

= Shares hundreds of TBs storage and virtualizes I/O 
= 160Gbps Ethernet Uplink 

Instructions Set Architecture agnostic 
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ivery of tailored solutions 


= Leveraging differentiated ІР 
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: T WAVE — SURROUND COMPUTING REVOLUTION 


к EY ID products will enable the transition 
_ HSA 
Ambidextrous 
Fast fabrics 
Relentless focus on power efficiency 


AMD inspired the interactive computing revolution 


Now leading the way to surround computing 
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THANK YOU 


AMD: 


[ Mionnalion contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to 
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences 
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or 
otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to 
time to the content hereof without obligation of AMD to notify any person of such revisions or changes. 


AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO 
RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. 


AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN 
NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES 
ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF 
SUCH DAMAGES. 
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are trademarks of Advanced Micro Devices, Inc. Other names and logos are used for informational purposes only and may 
be trademarks of their respective owners. 
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AMD RADEON HD 7970 
WITH GRAPHICS CORE NEXT (GCN) 
ARCHITECTURE 


Mike Mantor, AMD Senior Fellow 
michael.mantor@amd.com 
August 28, 2012 


AMD 


GRAPHICS СОВЕ МЕХТ ARCHITECTURE 


= Product Goals 
— Time to Market 
— Maximize Performance/Watt 
Enable first class GPU compute 
= Simplify GPU programming 
=  |mprove GPU utilization 
= Provide predictable performance 


Parallel Graphics/Compute Architecture 
— New ISA & Compiler 

— Distributed Compute Units 

— Global Unified Read/Write Cache 

— Asynchronous Compute Engines (ACE) 
— Reliability improvements with ECC 


= AMD Eyefinity Display Technology 
— Multiple Display Configurations 
— 3D Stereo Displays 
Flexible Audio 
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AMD RADEON" HD 7970 ARCHITECTURE 


Graphic Core Next (GCN) 
- 4.3 billion 28nm transistors 


1 ср да 


sng 
HSSAI 


ЯОА 


GCN Compute Unit 

- GCN Compute Unit 
GCN Compute Unit 
GCN Compute Unit 

- GCN Compute Unit 
П ССМ Compute Unit 
|. GCN Compute Unit 
GCN Compute Unit 
GCN Compute Unit 

- GCN Compute Unit 
I GCN Compute Unit 
. GCN Compute Unit 
GCN Compute Unit 
GCN Compute Unit 
GCN Compute Unit 
GCN Compute Unit |. GCN Compute Unit 


(An 


x 1 94188047) (NY 


AULA 
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AMD RADEON" HD 7970 ARCHITECTURE 


GCN Compute Unit 
. GCN Compute Unit 
GCN Compute Unit 
. GCN Compute Unit 
GCN Compute Unit 


GCN Compute Unit 
GCN Compute Unit 
GCN Compute Unit 
GCN Compute Unit 
GCN Compute Unit 


p 
Ж = 
< 
444 GCN Compute Unit GCN Compute Unit S 
134 GCN Compute Unit GCN Compute Unit в 
133 GCN Compute Unit GCN Compute Unit Е 
A GCN Compute Unit GCN Compute Unit = 
144 . GCN Compute Unit GCN Compute Unit 
434 GCN Compute Unit GCN Compute Unit E 
144 _ GCN Compute Unit | GCN Compute Unit ос 
44 GCN Compute Unit _ ССМ Compute Unit ES 
4434 . GCN Compute Unit GCN Compute Unit ES 
ҮҮ | GCN Compute Unit GCN Compute Unit 5 a 
144 GCN Compute Unit GCN Compute Unit a 
134 
434] 
41441 
4144 
124 
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Graphic Core Next (ОСМ) 
Advanced Power Management 


Fine grain clock\clock tree 
gating 


Power Tune — Dynamic V/F 
Scaling with power containment 


Zero Core Power — Power 
Gating 


AMD RADEON" HD 7970 ARCHITECTURE 


Graphic Core Next (GCN) 
32 Compute Units(CU) 


Non VLIW ISA 

Distributed Control Flow 
32/64b IEEE-2008 FP 
Integer, Logic & Video Ops 


GCN Compute Unit | 

GCN Compute Unit 

GCN Compute Unit | 
GCN Compute Unit 

' GCN Compute Unit | 
GCN Compute Unit 


i 4 Texture Units per CU 
- GCN Compute Unit 


E GCN Compute Unit | 
' GCN Compute Unit | 
- GCN Compute Unit 
H GCN Compute Unit | 
| GCN Compute Unit 
GCN Compute Unit | 


! GCN Compute Unit 
E GCN Compute Unit GCN Compute Unit 


GCN Compute Unit _ GCN Compute Unit 
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AMD RADEON" HD 7970 ARCHITECTURE 
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Graphic Core Next (GCN) 
e 984-bit GDDR5 - 264GB/Sec 
e Unified R/W Cache Hierarchy 
e 768КВ R/W L2 Cache 
e 16KB R/W L1 Per CU 
e 16KB Instruction 
Cache(I$)/4CU 
e 32KB Scalar Data 
Cache(K$)/4CU 


AMD RADEON" HD 7970 ARCHITECTURE 
Graphic Core Next (GCN) 
e РС Express® Gen 3.0 x16 


ха 194 


P 


d 
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AMD RADEON” HD 7970 ARCHITECTURE 


Graphic Core Next (GCN) 


e Global Data Share — 64 kb Shared 
Memory with global synchronization 


> GlobalData Share + resources (Barriers, Append, ordered 


append and named semaphores 
resources) 
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AMD RADEON” HD 7970 ARCHITECTURE 


Graphic Core Next (GCN) 
e Dual Geometry Engines 
e Dual Rasterizers 
° 8 Render Back-ends 
e 32 Pixel Color Raster 
Operation Pipelines (ROPs) 
e 128 Depth Test (Z)/stencil Ops 
e Color Cache (C$) 
e Depth Cache (Z$) 


4 
1-4 

i 

14 
AE 
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AMD RADEON” HD 7970 ARCHITECTURE 


П ССМ Compute Unit 
GCN Compute Unit 
GCN Compute Unit 
GCN Compute Unit 

|. GCN Compute Unit 

П ССМ Compute Unit 

- GCN Compute Unit 
GCN Compute Unit 
GCN Compute Unit 

- GCN Compute Unit 

I GCN Compute Unit 

| GCN Compute Unit 
GCN Compute Unit 
GCN Compute Unit 
GCN Compute Unit 

GCN Compute Unit 
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Graphic Core Next (GCN) 


Dual Asynchronous Compute 
Engines (ACE) and Dual DMA 
Compute ECC protection 


(DRAM & SRAM (Registers, Shared 
Memories, L1 & L2 Caches) 


GPU support for Compute APIs 
OpenCL™1.2, DirectCompute, 
С++ АМР 


Multi-Media and Display System 

• AMD EyeFinity 
e Single 16kx16k Image across 6 Displays 
* Drives three 3D Stereo Display 
° Flexible Bezel Display 

e Discrete Digital Multi-Point Audio 

e Multi-Display Video Conferencing 

e Directional Audio 


«Flexible Bezel Comp 
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AMD RADEON” HD 7970 ARCHITECTURE 


Multi-Media and Display System 
Universal Video Decoder (UVD) 
Fixed Function with codecs for: 


e H.264 

° VC-1 

e MPEG-2 (SD & HD) 
BJ e MVC (Blu-ray HD) 

. DivX@ 

• WMV МЕТ 

° WMV native 
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AMD RADEON” HD 7970 ARCHITECTURE 


Multi-Media and Display System 
Video Codec Engine (Fixed Function) 
= Multi-stream hardware H.264 HD 


Encoder 
B * Power efficient & faster than real- 
time 1080р @601р5 


= Two encode modes: full fixed & 
hybrid (with GPU compute) 
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AMD RADEON" HD 7970 ARCHITECTURE 


Multi-Media and Display System 
CrossFire™ Compositor 
° Controller for Multi-GPU Solutions 


* Dual, triple or quad-GPU scaling 


= 
= 
= 
л 
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GCN ARCHITECTURE SUPPORTS MULTIPLE PRODUCT CONFIGURATIONS 


e Memory Channels/L2 Partitions & I/O Pins > Number of Compute Unit and Number Textures 
Vertex/Primitives/Pixel Rates • 64b Floating Point Rates and ECC Options 
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GCN Compute Unit 


GCN Compute Unit GCN Compute Unit 


GCN Compute Unit GCN Compute Unit = GCN Compute Unit > 

_ GCN Compute Unit GCN Compute Unit ос GCN Compute Unit с 
GCN Compute Unit GCN Compute Unit 5 2 GCN Compute Unit 55 
GCN Compute Unit GCN Compute Unit НЕ _ ССМ Compute Unit 52 
GCN Compute Unit GCN Compute Unit % _ GCN Compute Unit z 
GCN Compute Unit _ ССМ Compute Unit GCN Compute Unit 
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SPEEDS & FEEDS 


Process 


Dec 2011 
AMD Radeon™ 
HD 7970 
28nm 


Feb 2012 
AMD Radeon™ 
HD 7770 
28nm 


March 2012 
AMD Radeon™ 
HD 7870 
28nm 


June 2012 
AMD Radeon™ 
HD 7970 GHZ каноп 
28nm 


Transistors 


4.3 billion 


1.5 billion 


2.8 billion 


4.3 billion 


Engine Clock 


925 MHz 


1000 MHz 


1000 MHz 


1 GHz / 1.05 GHz 


Primitive Rate 


Stream Processors 


2 ргіт / сік 
2,048 


1 prim / clk 
640 


2 prim / clk 
1,280 


2 ргіт / сік 
2,048 


Сотрше Performance 
(SPDP/DPFP) 


3.79 TFLOPS / 
947 MFLOPS 


1.28 TFLOPS / 
80 MFLOPS 


2.56 TFLOPS / 
160 MFLOPS 


4.3 TFLOPS / 
1.08 TFLOPS 


Texture Units 


128 


40 


80 


128 


Texture Fillrate 


118.40 GT/s 


40.0 GT/s 


80.0 GT/s 


134.40 GT/s 


ROPS/Pixel Fillrate 


32/30.24 GP/s 


16/16.0 GP/s 


32/32.0 GP/s 


32/33.60 GP/s 


Z/Stencil 


128 


64 


128 


128 


Memory Type 


3GB GDDR5 


2GB GDDR5 


2GB GDDR5 


3GB GDDR5 


Memory Width/Clock 


384/1.375 GHz 


128/1.125 GHz 


256/1.2 GHz 


384/1.5 GHz 


Memory Data Rate(Gpors) 


5.5 Gbps 


4.5 Gbps 


4.8 Gbps 


6.0 Gbps 


Memory Bandwidth 264 GB/s 72 GB/s 153.6 GB/s 288 GB/s 


Typical Board Power 


~205W 


~83W 


~150W 


~250W 


AMD ZeroCore Power 


<3W 


<3W 


<3W 


<3W 
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GCN DISCRETE GPU FAMILY 


шап. AE чеша ISSUU ED ^^ ресе. о 77ХХ 
A 
[2 f: 
[ras ree МЕ 
(се 
= 
ке 1.5b transistors 
123 5ОММ 
Dee 2.8b transistors 1e Ghz, 1.28 Tflop 
4.3b transistors 212 SQMM 10 CU / 16 Pix / 1 Tri 
352 5ОММ 1e Ghz, 2.56 Tflop 128b, 4.5gbps, 72 GB/S 
925e Mhz, 3.78 Tflop 20 CU / 32 Pix / 2 Tri 
32 CU / 32 Pix / 2 Tri 256b, 4.8gbps, 154 GB/S 
384b , 5.5gbps, 264 GB/S 
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ССМ COMPUTE UNIT 


< Basic GPU building block of unified shader system 
= Мем instruction set architecture 
Non-VLIW 
Vector unit + scalar co-processor 
Distributed programmable scheduler 
Unstructured flow control, function calls, recursion, Exception Support 
Un-Typed, Typed, and Image Memory operations 
- Each compute unit can execute instructions from multiple kernels simultaneously 
< Designed for programming simplicity, high utilization, high throughput, with multi-tasking 


Vector Units Scalar Texture Filter Texture Fetch 


Branch & Scheduler i | | ; 
Message Unit КЕБИНЕ ( Oa Unit е Reo S (4) Load / Store Units (1 б) 
í 


Vector Registers Local Data Share Scalar Registers L1 Cache 
(4x 64KB) (64KB) (4KB) (16KB) 
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ССМ COMPUTE UNIT (СИ) ARCHITECTURE 


Input Data: PC/State/Vector Register/Scalar Register 


Msg 


Branch& _ 1 
MSG Unit 


« 


| ї j i 
LDS | 64 KB LDS Memory 


= Decode 


=] 

o 

m 

= EM 

c z 

o o — Export/GDS Decode жж 

© = — Vector Memory Decode ————— НЕЕ sss 

= Q L. y є * + v 

aul © Scalar Scalar Unit SIMDO SIMD1 SIMD2 SIMD3 Export 

+ — : = => = — e 

О M эш Decode 8 KB Registers CAKE — ka — OR = ms R/W Bus 

=} = Registers Registers Registers Registers а data 

> = = Integer ALU R/W 
= T MP MP MP 

> Vector | = Vector << Vector ^ Vector “= L2 

= = 

a O 

O 

= 


4 CU Shared 16KB Scalar Read Only LT <= Баб. RW 
4 CU Shared 32KB Instruction L1 < Arb - L2 


htto://developer.amd.com/atds/assets/oresentations/2620 final.pdf 
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LOCAL DATA SHARED MEMORY ARCHITECTURE 


VecOp. LDS Instruction Decode 
Decode i ұ 
| | SIMD 0/1 — SIMD 2/3 
Se ë saine 
Input Buffering and Request Selection (1/2 VVAVE/CI К) 
ШШ?! ИНИИНИИИИИННИННИЦИНИНИИИИИНИИИИИИ 
Inp ut Address Cross Bar 
Ak т? : ДЕ; | iii TE |}; ji: ji: i 13 f 3 kil 
Detection о11|>1З314|5 |6 17 |8 јо 10 |1 |? ІЗ 14 |5 |6 |7 із ро о |1 
QE ГГ ГТ ТИТО ТИТ ГИГ ЕГ ГИТТИ ГЇ 
ed Read Data Cross Bar 


ШЕ ГЕТЕГЕТТЕГЕТЕТЕТЕГЕТЕГЕТЕГЕГЕТЕТЕТІЕТГЕГЕГЕТЕГЕГЕГЕ ШЕГЕТІН 


dicc d Atomic Units 


> Write Data Cross Bar 


Local Data Share (LDS) 64 KB, 32 banks with Integer Atomic Units 


* Advantages 
* Low Latency and Bandwidth amplifier for lower power 
* Software managed cache 
e Software consistency/coherency - thread group ма 
Hardware barrier 


* 64 kbyte, 32 bank Shared Memory 
* Direct mode - Interpolation @ rate or 1 broadcast 
read 32/16/8 bit 
* Index Mode - 64 dwords per 2 clks - Service 2 
waves per 4 clks 
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PREVIOUS VLIW SHADER ARCHITECTURE 
"Previous AMD GPUs used VLIW (Very Long Instruction Word) architecture 
— Combines instructions into a 4-wide VLIW that gets executed on a SIMD 


Shader Instructions VLIW Instruction 
X ү 7 үү 
ЕК п 
c =d+ e; Tests 
а= е + f; | 


b + c ста а+е е + f Thread п 
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PREVIOUS VLIW SHADER ARCHITECTURE 
"Previous AMD GPUs used VLIW (Very Long Instruction Word) architecture 
— Combines instructions into a 4-wide VLIW that gets executed on a SIMD 


Shader Instructions VLIW Instruction 
X Y 7 W 
With Dependencies cx idle idle = 


bre idie 

idle m 
іше [idle mesa 
| 


idie 
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PREVIOUS VLIW SHADER ARCHITECTURE 
"Previous AMD GPUs used VLIW (Very Long Instruction Word) architecture 
— Combines instructions into a 4-wide VLIW that gets executed on a SIMD 


Shader Instructions VLIW Instruction 
X Y 7 W 
With Dependencies mem idle idle = 


ata EN 
idle m 
іше [idle mesa 


idie _ 
т laga 
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PREVIOUS VLIW SHADER ARCHITECTURE 
"Previous AMD GPUs used VLIW (Very Long Instruction Word) architecture 
— Combines instructions into a 4-wide VLIW that gets executed on a SIMD 


Shader Instructions VLIW Instruction 
X Y Z W 
With Dependencies Ame idle idle = 


Бен idie 

idle m 
idle [idle mesa 
| 


idie 
Ж laga 
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PREVIOUS VLIW SHADER ARCHITECTURE 
"Previous AMD GPUs used VLIW (Very Long Instruction Word) architecture 
— Combines instructions into a 4-wide VLIW that gets executed on a SIMD 


Shader Instructions VLIW Instruction 
X Y Z W 
With Dependencies mae idle idle = 


ert EN 

idle m 
іше [idle mesa 
| 


ЕСТЕН 
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NEW NON-VLIW SHADER ARCHITECTURE 


*SIMD architecture without VLIW instructions 
— No need to combine instructions, since multiple threads can run in parallel 


Shader Instructions ALUS 


With or without 


Dependencies S, 
a =b + c; 

4 b + c b + c b + с нини b +c 
ь = а+а; m БЕМЕК 


So $2 Sh 
ETS [rte | 
-— Án 
d = c + ғ; 
ЕЖЕН ЕЖЕН 


No idle ALUs due to no dependencies! 
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IS VLIW А СООР LONG ТЕНМ SOLUTION? 


LANE 0 LANE 1 LANE 2 LANE 15 0 ТЕМЕ 15 01 КАМЕ 


SIMD 
———swwIOIVE bon ———— ———————À—— 


VLIW4 SIMD GCN Quad SIMD 


64 Single Precision multiply-add 64 Single Precision multiply-add 
1 VLIW Instruction x 4 ALU ops > dependency limited 4 SIMDs x 1 ALU ор Э occupancy limited 


Standardized compiler scheduling & optimizations 


Specialized, complex compiler scheduling 


Compiler manages register port conflicts P No register port conflicts 


Difficult assembly creation, analysis, and debug Simplified assembly creation, analysis, and debug 
Complicated tool chain support Simplified tool chain development and support 


Careful optimization required for peak performance Stable and predictable performance 


VLIW packing sometimes requires domain transformation to achieve good utilization. 
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CODE EXAMPLE 


float fnO(float a,float b) //Registers r0 contains “a”, rl contains "b" 
( //Nalue is returned in r2 
if(a»b) 
return ((a-b) *a) ; v cmp gt #32 к0,г1 //a > Б, establish УСС 
else //Save current exec mask 
return ( (b-a) *b //Do “if” 
v_sub £32 r2,r0,r1l //result = а - b 
v mul £32 r2,r2,r0 //result-result 
//Do “else” (50 & !ехес) 
v sub £32 r2,rl,r0 //result =b -a 
v mul £32 r2,r2,rl //result - result * b 
//Restore exec mask 


Generally straight forward to generate and understand ISA 
VCC - Vector condition code 
EXEC - Execution mask 


Multi-threaded enables full vector unit utilization 


| GCN | ! 


ССМ SCALAR/VECTOR СОМРИТЕ UNIT 


= Simpler ISA compared to previous generation 
No clauses and latency for transitions 
No VLIW packing required 
Control flow directly programmed (Exec mask control) 
Complex Control Flow Supported (Example: non uniform Branch into loop) 
" Scalar engine 
— Lower latency for distributed sequencer verses previous centralized 
— Reduces performance in previously clause bound cases 
— Reduces power handling of control flow Ор as control is closer 
= Advanced language feature support 
— Exception support 
— Function calls 
— Recursion 
= Enhanced extended ALU operations 
— Media ops 
— Integer ops 
— Integer atomic operations 
— Floating point atomics (min, max, cmpxchg) 
= Enhanced debug support 
— HW functionality to improve debug support 


29 | GCN | HotChips 2012 


R/W CACHE HIERARCHY 


16KB instruction cache (I$) + 
32 KB scalar data cache (K$) 
shared per 4 CUs with L2 backing 


Each CU has 256kb registers апа — * 
64kb local data share 


L1 read/write 16kb write > 
through caches - 1 GE: | 
64 Bytes / CU / clock pu 


| 


Global data 


L2 read/write cache partitions. — , E^ ND e = 
(64kb/128kb) write back caches share facilitates 


iti [ЖИШГШ Dual Channel КЕЕ Dual Channel 64b Dual Channel | | 
64 Bytes / partition / clock pd Controller —— Controller synchronization 
between CUs 
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GPU MEMORY MODEL 


= Relaxed memory model 

- All work-items within same work groups see same L1 cache 

— Work-items of different work groups may use different L1 caches 

— All work-items and command streams use the same L2 cache 

— Command stream packets & Shader Instruction control data visibility 

— Sufficient primitives in the GPU hardware to implement C++ 11 memory model 
= GPU Coherency 


— Acquire/Release semantics control data visibility across the machine (Compiler controlled bit 
on load/store instructions) 


- L2 coherent 3 all CUs & СР can have the same view of memory 
* Remote Global atomics 

- Performed in L2 cache 

— Full set of integer ops and float max, min, cmp_swap 
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AMD ССМ CU ARCHITECTURE SUMMARY 
= Heavily multi-threaded CU architected for 
throughput 


— Efficiently balanced for graphics and general 
compute 


— Simplified coding for performance, debug and 
analysis 


— Simplified machine view for tool chain 
development 


— Low latency flexible control flow operations 


— Read/Write Cache Hierarchy improves I/O 
characteristics 


— Flexible vector load, store, and remote atomic 
operations 


— Load acquire/store release consistency controls 
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REFERENCE 


http:/Awww.amd.com/us/products/desktop/graphics/7000/7970/Pages/radeon-7970.aspx#/1 


AMD Display Technologies whitepaper 
AMD Еуейику Technology whitepaper 

AMD Power Technologies whitepaper 

AMD Video Technologies whitepaper 
Graphics Core Next Architecture whitepaper 


htto://developer.amd.com/atds/assets/oresentations/2620_final. pdf 
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CODE EXAMPLE 


float fn0(float a,float b) //Registers r0 contains "a", rl contains "b" 
( //Nalue is returned in r2 
if (a>b) 
return ((a-b)*a); v cmp gt f32 к0,г1 //a > Б, establish VCC 
else //Save current exec mask 
return ( (b-a) *b //Do “if” 
//Branch if all lanes Ға11 
v sub #32 //result = a- b 


Optional: e v mul £32 //result-result * a 
` 


`` 
Use based on the number of ~ï | 
instruction in conditional section. PA //Branch if all lanes fail 
= Executed in branch unit v sub £32 //result = b а 


//Do “else” (50 & !ехес) 


v mul f32 //result = result * b 


//Restore exec mask 
Generally straight forward to Throughput optimized for vector 
generate and understand ISA instructions 


Instructions types interleave within Optional scalar instructions jump fully 
program predicated groups of instructions 
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= Disclaimer & Attribution 


The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical 
errors. 


The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and 
roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing 
manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this 
information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to 
notify any person of such revisions or changes. 


NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED 
FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. 


ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO 
EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES 
ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH 
DAMAGES. 


AMD, the AMD arrow logo, AMD Radeon and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this 
presentation are for informational purposes only and may be trademarks of their respective owners. 


OpenCL is a trademark of Apple Inc. used with permission by Khronos. 
DirectX is a registered trademark of Microsoft Corporation. 

DivX is a registered trade mark of DivX Inc 

PCI Express is a registered trademark of PCI-SIG 


© 2012 Advanced Micro Devices, Inc. All rights reserved. 
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НОТ СНІР5 2012 


AMD “TRINITY” APU 
AMDA 


Sebastien Nussbaum 
AMD Fellow 
Trinity SOC Architect 


AMD АРИ “ТКІМІТҮ” WITH AMD DISCRETE CLASS GRAPHICS 
ALL NEW ARCHITECTURE FOR UP TO 50% GPU’ AND UP TO 25% BETTER X86 PERFORMANCE? 


“Piledriver” Cores 


— Improved performance and power efficiency 
— 8rd-Gen Turbo Core technology 
— Quad CPU Core with total of 4MB L2 


2nd-Gen AMD Radeon™ with DirectX? 11 support 
- 384 Radeon™ Cores 2.0 


HD Media Accelerator 
— Accelerates and improves HD playback 
— Accelerates media conversion 
— Improves streaming media 
— Allows for smooth wireless video 


Enhanced Display Support 
- AMD Eyefinity Technology? 
- 8 Simultaneous DisplayPort 1.2 or HDMI/DVI links 
— Up to 4 display heads with display multi-streaming 


DDR3 DIMMS 


*Piledriver* 
v. x86 Cores | 


ТТ | 


= 
17 
=i 
° 
< 
о 
о 
= 
= 
о 
= 
Ф 


AMD HD Media | 
Accelerator | 


Platform Interfaces 


rt 


POI >> "5 
evi : 


EXPHESS 
x16 PCIe 

USB2 
PCIe 4x1 


USB 3.0 


UNIFIED MEDIA 
INTERFACE Í 


A-Series 
Chipset 


SATA VGA 
LPC HD AUDIO 
SPI CIR 
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AMDA 


“TRINITY” FLOORPLAN 
32nm SOI, 246mm2, 1.303BN TRANSISTORS 


DDR3 Controlle det 


Dual Channel DDR3 
Memory Controller 


(R 
AMD HD Media Accelerator 
(UVD, AMD Accelerated 
Video Converter) 


w 
Unified Northbridge 


ass ss ——— 


AMD Radeon™ GPU 


Up to 4 
“Piledriver” 
| Cores with total 4MB 12 


HDMI, DisplayPort 1.2, 
DVI controllers 


PCI Express® 1/0 — 
24 lanes, optional digital 
_display interfaces 


Ja, 


AMD 


AMD 2" GENERATION “BULLDOZER” CORE: 
“PILEDRIVER” 


32nm "PILEDRIVER" COMPUTE MODULE 
x86 CORE REDESIGN 


Shared Fetcher / prediction pipeline - 64KB І-Сасһе 
Shared 4-way x86 decoder 


Shared Floating Point Unit - dual 128-bit FMA pipes 
Shared 16-way 2MB L2; 


Dedicated integer cores 
— Register renaming based on physical register file 
— Unified scheduler per core 
— Way-predicted 16KB L1 D-cache 


Instr Instr 


— Out-of-order Load-Store Unit Retire EIS [E Retire 
ISA additions: ҒМАЗ, F16C 


Lightweight profiling support in HW 


- 14% improvement for desktop® 


— AMD Turbo Core 3.0 
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“PILEDRIVER” CORE FLOOR PLAN 


AU le жәйт рең, 
Fem 


Eu Pre 
Branch ^ - en 


Instruction 
Predictor : Ў Deco де 
А bi 


RE A Т. ы 


Integer RE (77: Integer 


= 1 
пел алир ана изя Nan 
Т | g 


E de 


каз isa | 


| Scheduler _Scheduler (11 Datapath À НИНИНИННІНЕШНІ | ННІ | 
= MM Tes cus жел amm cm nd йин ЕКЕ. golécisogeiteceisssses 


TTT tee 
16KB L1 
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“PILEDRIVER” IMPROVEMENTS % ENHANCEMENTS VS. “BULLDOZER” 


2x Larger L1 TLB 


АМОД 
“Bulldozer” 
Hybrid Predictor c———— 
Augmented with ч. - | ISA : 
2"d level predictor ne 
3 
FPU and INT 
Faster Instruction exe 
HW Divider 


| ж. ? Instr Retire 
Improved Store-to- | 
prefetching 


Load Forwarding 
improvements - - ——— L 


| m | 60 entry Fe m SYSCALL & SYSRET 
Scheduler 
x L2 efficiency and 


“PILEDRIVER” IMPROVEMENTS 


= Design optimized for wide 
operational range (0.8V to 1.3V) 


= 30% higher frequency at same 
voltage as “Stars” CPU Core in 
“Llano” 


= 50% more base product 
frequency vs. “Llano” at 
same 35W SOC TDP 


A10-4600M vs A8-3600M 
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* Loop Predictor 

* Way Predictor 

* Dispatch gating based on 
group size 

* Clock Gating 

* Reduction in high power flops 


= |ntelligent L2 content 
tracking to speed up L2 
flush 

= State save/restore latency 
improvements to speed up 
power gating 


B 
N-— 
4213 


АМОД 


АМОЙ 


MEDIA PROCESSING ACCELERATION 


AMD'S UNIFIED VIDEO DECODER (UVD) 


AMD 
UVD sS 
15! generation 2nd generation AMD A-Series APU 
H.264 / AVCHD H.264 / AVCHD H.264 / AVCHD 
VC-1 / WMV profile D VC-1 / WMV profile D VC-1 / WMV profile D 
SO, à. À 311MhPO à № 
MPEG-2 MPEG-2 
ы 74 
МРЕС-4 / DivX 
ü RADEON Bitstream decode Bitstream decode Bitstream decode 
ARDA Picture-in-Picture Picture-in-Picture 
Dual stream HD+SD Dual stream HD+SD 


“TRINITY” ACCELERATED VIDEO CONVERTER (“AVC”) 


АМОД 


= Multi-stream hardware H.264 НО Encoder 
= Power-efficient and faster than real-time® 1080р @60fps 


= 4:2:0 color sampling video 
= Optimizations for scene changes (games and video) 
= Variable compression quality 


= Audio / Video multiplexing 
= Input from frame buffer for transcoding and video conferencing 
= Input from GPU display engine for wireless display? 
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VIDEO ENCODING SYSTEM OPERATION 


APU АМОД 


Bitrate Feedback 
RADEON x86 4---- 


GPU | XO" Rate Control 
АМОП 


Current frame 
(Uncompressed YUV420) 
Lo E 


Forward 
Transform 
(e.g. FDCT) 


Reference frames in the 
group of pictures (GOP) 


0100110010110 


Entropy 10111100... 


Encode 


H.264 
Compressed 
Stream 


AMD 


GPU DESIGN UPDATES FOR GAMING AND COMPUTE 


ЗО ЕМСІМЕ 


Command Processor 


= DirectX® 11 — SM 5.0, OpenCL™ 1.1, DC 11 


= GPU Core made of 384 Radeon™ Cores , each 
capable of 1 SP FMAC per cycle 


— Organized as 96 stream processing units — each 4- 
way VLIW (vs. 5-way in Llano) 


- 6 SIMDs (each contains 16 processing units) 


— Each SIMD share 1 texture unit — achieving 4:1 
ALU:Texture rate 


= 32 depth / stencil per clock, 8 color per clock 


= 24x multi-sample and super sample, 16x 
anisotropic filtering 


< 
о 
3 
о 
= 
< 
C 
o 
3 
et 
= 
2 
D 
= 


= Improved hardware tessellator vs. “Llano” 
* Compute improvements 


— Asynchronous dispatch: multiple compute kernels 
with independent address space simultaneously 
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АМОЙ 


PERFORMANCE ACHIEVEMENTS 


PERFORMANCE INCREASE ON CLIENT WORKLOADS (FOR 35W ТОР) 
"TRINITY" VS. “LLANO” CPU PERFORMANCE INCLUDING POWER MANAGEMENT, FREQ AND IPC GAINS 


Digital Media 


Sony Vegas 

PowerDirector 9 HD Transcode 
Handbrake HD 

Handbrake DVD 

iTunes 

Quicktime Pro 

Photoshop Elements 2.0 


0% 10% 20% 30% 40% 50% 60% 


Web % Productivity: Compression & Cryptography 


® 
PCMark7 Score 
PCMark? Productivity 
Windows Send & Compress 
7Zip 
GUIMark2 HTML5 
GUIMark2 Bitmap 


0% 10% 20% 30% 40% 50% 60% 70% 


Experimental setup: see footnote 10 
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AMDA 


PERFORMANCE AND POWER COMPARISON VS PRIOR-GENERATION 


АМОЯ 


Visual Performance - 3DMark Vantage Performance Battery Life Hours - Windows Idle 
See footnote 11 (Est. 62 Whr. Battery) 


m Trinity 
m Llano 
0 5 10 15 
0 2000 4000 6000 See footnotes ^ and 8 or battery life measurement considerations 
el 
General Performance - PCMark’ Vantage Overall Compute Capacity - Calculated CTP SP GFLOPS 


See footnote 11 


0 2000 4000 6000 8000 See footnote 9 


Trinity performance based оп estimates and/or preliminary benchmarks and are su j 
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АМОЙ 


AMD TURBO СОНЕ 3.0 TECHNOLOGY 


AMD TURBO CORE 3.0 TECHNOLOGY : OVERVIEW 
UTILIZE CALCULATED AVAILABLE DYNAMIC THERMAL HEADROOM TO IMPROVE PERFORMANCE 
= 10-20 °C variations across die during peak load мой 
GPU-dominated workload (3DMark®) 


with single thread application on CPU 
CPU0=2.7W, GPU-23.9W 


CPU-dominated workload (Livermore Loop 1Thread) 
CPU0=17W, GPU-4.2W 
CPUO At 


Tjmax GPU At 
= De 


DieY_ ^ 


T - 1 " 
di "3 a. Die X Die Y _ Р, 
Im 4 11 i: 1 Е 
"TQ m dim dim Die X 
HR T Т ' dim 
Simulation results for engineering discussion — no claims made to applicability to specific configuration of sold products. 


= Chip divided into “Thermal Entities” (ТЕ) 
— Thermal Entity calculate power and thermal density 


= Thermal RC network 
— Transfer coefficients that describe thermal transfer between Я 
ТЕѕ, substrate апа package аге characterized E s " 
— Numerical analysis firmware runs on the management ne м) 
processor which calculates рег ТЕ temperatures Ж 
— TEs are throttled using voltage/frequency adjustments 
according to workload heuristics 


Heat Sink Ф 


Socket 
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AMD TURBO CORE 3.0 TECHNOLOGY: CALCULATED VS. MEASURED TEMPERA 


CPU module 0 (Calculated) 
—— CPU module 1 (Calculated) GPU (Calculated) 


Hotspot (measured) 


Estimated +/- 3- 
5C difference in 
calculated hotspot 
vs. measured hot 
spot temperature, 
at steady thermal 
state 


Measured hotspot temperature 


Experimental results for engineering review, no observable product functional operational difference results from thermal differences. No claims made to accuracy 
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AMD TURBO CORE 3.0 TECHNOLOGY - PERFORMANCE 


АМОД 
= Workloads of moderate activity have high residency at maximum frequency 


— Thermal headroom allows hotspot to remain below maximum control temperature 


= Higher activity workloads offer fewer opportunities to raise frequency and benefit from 
intelligent algorithms to bias power levels between CPU and GPU 


- Collaborative or compute CPU/GPU applications 
— Multi-threaded workloads 


Trinity/Llano Client Performance vs. TDP 
Power Management gains increase at low power 
40% 


30% 
20% 


10% 


0% 


17W 25W 


Setup information: see footnote 12 


45W 65W 100W 
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АМОЙ 


HARDWARE IMPROVEMENTS FOR LOW POWER 


SYSTEM OPTIMIZATION POINTS 
LEAP IN LOW POWER DESIGN 


= Idle — blank screen — system on 
= ММО7 — Mobile Mark 07 
= Media playback — user experience 


= Performance computing / gaming 


— “Trinity” increases performance within fixed 
cooling solution 

— Trinity’s significantly higher performance 
results in lower energy consumed for fixed 
amount of work or frames rendered, but higher 
power consumption during work 
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Avg Power (W) 
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16.000 Average Power (APU+FCH) 
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E "Llano" 
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Idle ММ07 Playback 3DMark06 


AMD A10-4600M APU on AMD “Pumori” reference board, 2x2GB DDR31600, SSD C300, Windows ® 7 
64bit. Catalyst M 8.941 vs A8-3600M,. 2x2GB DDR31600, SSD C300, Windows 7 64bit. Driver 


8.941.Testing done at 1366x768. See footnotes ^ and 8 or battery life E considerations. 
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“TRINITY” FINE-GRAIN POWER САТІМС ISLANDS (2) 
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AMD HD Media Accelerator 

(UVD, AMD Accelerated 

$ Video Converter) 

| Graphics Memory \ 

| Controller 

(Graphics optimized 

\memory request scheduler) 
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Module 1 
Independent 


on-die power gated islands | 


Display Controller | 
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“TRINITY” FINE-GRAIN POWER GATING ISLANDS (3) 


Rue AMD 


UVD 


, — 
Misc | ~ AMD HD Media Accelerator 
G raphi ICS VCE (UVD, AMD Accelerated 
Video Converter) 
SIMD аттау” — "E m 


Controller 
шыр атау (Graphics optimized 
86 SIMD array \memory request scheduler) 
X 


Module 1 Module 1 | SIMD array 
| |. SIMD array 


SIMD array 


Independent 


Additional power-down 
region when all graphics 
functions are shut down 
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SMART NORTHBRIDGE OPERATIONAL 5ТАТЕ5 


AMDA 


= Goals 
— Intelligent selection of DRAM and Northbridge frequency to meet performance and power needs 
— Additional power savings from reduced DDR termination and drive strengths at low DDR speeds 


= Design supports low-latency transitions between several operational V/F points 
— 4 Northbridge frequencies, 2 DRAM clock rates 


= Intelligent frequency selection based on performance needs from CPU / GPU and Multi-media 
— Memory intensive workloads and certain multi-media content types trigger switch to higher DDR speed 
— CPU intensive workloads switches to higher Northbridge frequency to improve latency to memory 
— Multi-media buffers store real-time data during low-latency switching (less than 10 us) 


= Frequency selection is further optimized by 
— Static user policy selection of battery or performance optimization 
— OS power management hints 
— Heuristics to ensure higher voltage and frequency will not result in additional work throttling 
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THANK YOU ! 


AMD “TRINITY” APU 


= Core redesign for greater Performance 


= Audio and Video enhancements for the best 
media experience 


= Improved GPU performance with Radeon™ 
Cores 2.0 


= Low Power Leadership 
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FOOTNOTES 


1. 
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Testing performed by AMD Performance Labs. The score for the 2012 AMD A10-4600M on the “Ритогі” reference design for PC 
Mark ® Vantage Productivity benchmark shows an increase of up to 29% over the 2011 AMD A8-3500M on the “Torpedo” 
reference design. The AMD A10-4600M APU has a score of 6125 and the 2011 AMD A8-3500M APU scored 4764. 


Projections and testing developed by AMD Performance Labs. Projected score for the 2012 AMD Mainstream Notebook 
Platform “Comal” the “Pumori” reference design for 3D Mark ® Vantage Performance benchmark is projected to increase by up 
to 50% over actual scores from the 2011 AMD Mainstream Notebook Platform “Sabine.” Projections were based on AMD 
A8/A6/A4 35w APUS for both platforms. 


AMD Eyefinity technology works with games that support non-standard aspect ratios, which is required for spanning across 
multiple displays. To enable more than two displays, additional panels with native DisplayPort™ connectors, and/or 
DisplayPort™ compliant active adapters to convert your monitor’s native input to your cards DisplayPort™ or Mini- 
DisplayPort™ connector(s), are required. AMD Eyefinity technology can support up to 6 displays using a single enabled AMD 
Radeon™ GPU with Windows Vista® or Windows® 7 operating systems . 


Testing and calculations by AMD Performance Labs. Battery life calculations based on average power on multiple benchmarks 
and usage scenarios. These include Active metric using FutureMark® 3DMark ’06 (172 min./2:54 hours), streaming YouTube 
video (271 min./4:30 hours), playback of a Microsoft sample clip from local HDD (303 min./5:03 hours), PowerMark ® 
Productivity benchmark/radio off (483 min./8:03 hours), web browsing test was average of 40 minutes via 802.11n WLAN, 2 
minutes per page using the web test tool developed by AMD (570 min./9:30 hours) and Windows ® Idle (725 min./12:05 hours) 
as aresting metric. All battery life calculations are based on using a 6 cell Li-lon 62.16Whr battery pack at 98% utilization for 
Windows ® Idle, PowerMark ® and 96% utilization for 3DMark ® 06 workload, video playback and YouTube video streaming; 
and 92% utilization for Blu-ray playback. 


Projections and testing developed by AMD Performance Labs. The AMD A-10 5800K APU with AMD Radeon™ HD 76600 
graphics, versus an AMD A8-3850 APU with 14% uplift on x86 performance in measure in PCMark7 ® Productivity, and 30% 
planned uplift on graphics performance using 3DMark ® 11 (P). All systems using "Trinity" 100W APU, 8GB DDR3-16000 


memory, Windows ®7 64 bit. ” À 


AMDA 


FOOTNOTES (2) 


6. 


10. 


11. 


12. 


13. 
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Based on AMD internal testing of video encoding speed of VCE of 1080p H.264 video at 47 seconds, which is faster than the 65 
second size of the 480p-kid.mov video file. System configuration: OS: Windows ® 7 64-bit, CPU: AMD A10-5800K with AMD 
Radeon™ HD 7660D graphics, Annapurna reference board, 8GB DDR3-1600, Windows ® 7 64bit. 


AMD Wireless Display technology provides the ability to wirelessly display local screen content onto a remote screen in real 
time. Compliant receiver equipment required. 


Testing and projections conducted by AMD performance labs. Testing on the 2011 AMD Mainstream Notebook Platform show 
663 minutes (11.05 hours) of Windows ® Idle as “resting” battery life. Projections for the 2012 AMD Mainstream Platform 
“Comal” show 748 minutes (12.47 hours) of Windows ® Idle as “resting” battery life. 


GFLOPs calculations developed by AMD performance labs measuring compute capacity for the 2012 VISION A10-based 
notebook which scored 603 GFLOPS. AMD GFLOPs calculated using GFLOPs = CPU GFLOPs + GPU GFLOPs = CPU Core Freq. 
X Core Count X 8 FLOPS + GPU Core Freq. X DirectX® 11 capable Shader Count X 2 FLOPS. 


Experimental results on A10-4600M with Radeon HD7660G Graphics (“Trinity”) vs. A8-3500M 4GB DDR3-1600 with Radeon 
HD6620G Graphics (“Llano”) 4GB DDR3-1333 - running under Windows ® 7 Ultimate, with Hitachi HDD 5400RPM 


Projections and testing developed by AMD Performance Labs. Projected scores for the 2012 AMD Mainstream Notebook 
Platform “Comal” the “Pumori” reference design for 3D Mark ® Vantage Performance, PCMark ® Vantage over actual scores 
from the 2011 AMD Mainstream Notebook Platform “Sabine”. Projections were based on AMD A10/A8/A6/A4 35w APUs. 


AMD A10-4600M APU with Radeon(tm) HD Graphics, 4GB DDR3-1600, on Pumori Reference Board with Hitachi 5400 RPM HDD. 


Power measured by AMD Perf Labs on “Trinity” А0 silicon running Ѕресіпї ® 2006 on Pumori Reference board, and on Orochi 
BO (which contains “Bulldozer” Core) at same voltage and frequency. 20% dynamic power improvement was offsetted for 
caching structures differences and leads to an estimate of more than 10% dynamic power reduction directly attributable to the 
Core ® on Specint 2006. Frequency improvement vs. ¿Stars” Core measured by AMD РЕО for nominal process targeting on 
“Llano” Rev. BO and “Trinity” Rev. А1 - ; | ; 
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DISCLAIMER & ATTRIBUTION 


AMDA 


The information presented in this document is for informational purposes only and may contain technical 
inaccuracies, omissions and typographical errors. 


The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to 
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences 
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or 
otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to 
time to the content hereof without obligation to notify any person of such revisions or changes. 


NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY 
IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. 


ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY 
DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR 
OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF 
EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 


AMD, the AMD arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this 
presentation are for informational purposes only and may be trademarks of their respective owners. 


AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other 
jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners. 
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APPENDIX 


AMD TURBO СОНЕ 3.0 TECHNOLOGY - DIE THERMAL SIMULATIONS 


= Thermal simulations for a 35W product 


— 10-20°C variations across the die depending on the workload, during peak activity 
— Hotspot needs to be controlled to maximum junction temperature 


— Hotspot thermal simulations are now critical part of the performance optimization flow 


CPU-dominated workload (Livermore Loop 1 Thread) pigs Mio кй сша 
CPUO-17W. GPU-4.2W with single thread application on CPU 
i CPU0=2.7W, GPU=23.9W 
CPU0 At 
к= Тітах GPU At 
Тітах 


š 
и š EK Xd 
TITIISEERESS 
š ЭЖ. ЖА еі 


Die Y — 
dim 


Simulation results for engineering discussion — no claims made to applicabilit ific configuration of 
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PLATFORM POWER SAVINGS : AMD SERIAL VOLTAGE INTERFACE 2.0 


АМОД 
= ӨМІ іѕ the interface which allows the processor to 
communicate information to and from voltage regulator 
= SVI2 enables quicker power state transitions Regulator Efficiency vs. Load 


со 
ол 


- Faster data transmission rates (33Mhz) 
— New regulator response when transition is complete 
— 80+% improvement in 500mV set point change latency 


= Power efficiency features 
— Multiple Power State Indicators sent to regulator 


= PSIO — Current low enough that regulator can shed phases 


= PSI — Current low enough that regulator can use pulse skipping / 
diode emulation 


— Load Line trim, offset 0 
" Ability to adjust DC offset and load line slope based on APU state 


1 phase regulation 
2 phases regulation 


o Efficiency (96) 


Load Current (A) 30 
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“TRINITY” DESIGN FOR LOW POWER — NEW FEATURES 


Display Power Optimizations Power Gating, low voltage І/О amu 
* Static-screen display refresh from single DRAM * UNB Power Gated when idle 
channel * GPU Core support per-SIMD power gating 
* On-die cursor caching * PCI-Express and Display PHY Power 
* Increased on-chip buffering of display memory Gating 
* Accelerated Video Converter Power down 
Power Tuning * Support for 1.25V DDR3 Memory 


* Voltage and frequency are automatically 

selected using indication from 

* GPU Power state, PCle ® speed, Multimedia Graphics and Multimedia 

workload * Video Compression Engine — offload 

* Dynamic DRAM speed - reduced power when engine to save encoding power 

bandwidth requirements are low 
e SVI-2 Voltage Regulator interface —selection of 

optimal regulator power state depending on 

load 


Е A 
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AMD TURBO CORE 3.0 TECHNOLOGY : TEMPERATURE CALCULATION 


АМОД 
= CPU/GPU Temperature 


Firmware regularly calculates instantaneous temperature for each TE new power estimate and prior 
temperature 


Uses a 5 stage thermal RC ladder 
= Other silicon contributors 


High Speed IO interfaces, Northbridge are modeled as power and/or temperature offsets to simplify 
calculations 


This has limited impact on accuracy 
= Measured error of +/-5 °С on 3DMark‘analysis 


= Algorithm provides deterministic operation and reproducibility of results 
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UNIFIED NORTHBRIDGE AND MEMORY 


AMDA 


Dual Core 
Module 1 


Dual Core 
Module 0 


= Unified Northbridge 
— First UNB for APUs featured on “Trinity” 
— Supports: 
= Interface to a Graphics Memory Controller 


= Two DDR2/3 interfaces, shared with the 
Graphics Northbridge via Radeon Memory Bus 


System Request Interface 
(SRI) 


= APU Power Management Crossbar (XBAR) 
= Memory Support 
— 128-bit interface arranged as two un-ganged 64-bit атту; — 
channels 
— Supports Memory P-states — with memory speed Control Radeon™ Memory Bus 
changes on the fly Link 
— Supports 1.25V DIMMS сым чару ST 
= Up to 29.8 GB/s with DDR3-1 866 Controller Controller Controller 


E | 
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RADEON СОНЕ 5 2.0 DESIGN 


= VLIW4 thread processors 
— 4-way co-issue 
— All stream processing units now have 
equal capabilities 
— Special functions (transcendentals) occupy 3 of 
4 issue slots 


= Allow better utilization than previous 
VLIW5 design 
— Improved performance/mm? 
— Simplified scheduling and register 
management 
— Extensive logic reuse 
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Stream Processing Units 


FP ops per clock Integer ops per clock 


4 32-bit FMA, MAD, MUL or ADD 4 24-bit MAD, М JL or ADD 
2 64-bit ADD 4 32-bit ADD or bitwise ops 
1 64-bit FMA ог MUL 1 32-bit MAD ọr MUL 
1 Special Fuhction 1 64-bit ADD 


AMD 


иип youesg 


DISPLAY TECHNOLOGY LEADERSHIP 


DP1.2 


ОР / VGA / ОМІ / НОМ!  DP/VGA/DVI/HDMI  DP/VGA/ DVI / HDMI 


L 2 3 


L _ => C SX 


— Stereo 3D — 
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Intel® Xeon Phi™ coprocessor 
(codename Knights Corner) 


George Chrysos 
Senior Principal Engineer 
Hot Chips, August 28, 2012 
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Intel® Many Integrated Core (Intel МІС) Architecture 


Targeted at highly parallel HPC workloads 
* Physics, Chemistry, Biology, Financial Services 


Power efficient cores, support for parallelism 
* Cores: less speculation, threads, wider SIMD 
* Scalability: high BW on die interconnect and memory 


General Purpose Programming Environment 
e Runs Linux (full service, open source OS) 
e Runs applications written in Fortran, С, C++, ... 
m 4 k Supports X86 memor) т оде!, ТЕ ЕЕ754 f 
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Knights Corner — Power Efficient 


Performance per Watt of a prototype Knights Corner Cluster 
compared to the 2 Top Graphics Accelerated Clusters 
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Knights Corner Micro-architecture 
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Vector Processing Unit 
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Vector ALUS 


16 Wide x 32 bit 
8 Wide x 64 bit 


Fused Multiply Add 


Interconnect 


BL - 64 Bytes 


L2 | L2 
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BL — 64 Bytes 


Distributed Tag Directories 


TAG Core Valid MaskState 


TAG Core Valid MaskState 


Interleaved Memory Access 
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Interconnect: 2Х AD/AK 


BL - 64 Bytes 
Core Core 


BL — 64 Bytes 
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Multi-threaded Triad — Saturation for 1 AD/AK Ring 
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Multi-threaded Triad — Benefit of Doubling AD/AK 
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Streaming Stores 


Streams Triad 
for (1=0; ic HUGE; i++) 
Ali] = k*B[i] СШ; 


Without Streaming Stores 
Read A, B, C, Write A 
256 Bytes transferred to/from memory per iteration 


With Streaming Stores 
Read B, C, Write A 
Bytes transferred to/from memory per iteration 


Multi-threaded Triad — with Streaming Stores 


Silicon Data 
Streaming > 30% 


Performance 


Cache Hierarchy Micro-architecture Choices 
L2 TLB 
64 entry, holds PTEs апа PDEs vs. по L2 TLB 


Dcache Capability 
Simultaneous 512b load and 512b store vs. 1 load or store per cycle 


L2 Cache 
512 KB vs. 256 KB 


Hardware Prefetcher 
16 stream detectors, prefetch into the L2 vs. no HWP (rely only on software prefetching) 


Per-Core ST Performance Improvement (per cycle) 


Spec FP 2006 


Performance impact of KNC core uArch improvements 


Caches — For or Against? 


к Relative BW и Relative BW/Watt 


L2 Cache BW 
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Example: Stencils 
spatial time-step simulation of a physical system 


and performance/ 
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Power Management: All Оп and Running 
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Core С1: Сіоск Gate Core 
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Core C6: Power Gate Core 
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Timeout when all cores have been in C6, 
clock gate the L2 and interconnect 
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Package C6 
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Summary 


Intel® Xeon Phi™ coprocessor provides: 


“а Ñ 
| 


( intel inside 


Performance and Performance/Watt for highly parallel НРС 
with cores, threads, wide-SIMD, caches, memory BW 


Intel Architecture 
general purpose programming environment 
advanced power management technology 


Thank You 


Knights Corner brought to you by: 
IAG (Intel Architecture Group) 
e DCSG (Data Center and Systems Group) 
* VPG (Visual and Parallel Group) MIC 
— HW Architecture 
— HW Design 
— SW 
SSG (Software and Services Group) MIC 


Vector Processor: 512b SIMD Width 


Shared Multiplier 
Circuit for SP/DP 


2:1 Ratio go 


Gather/Scatter Address Machinery 
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loop: gather-step; jump-mask-not-zero loop 


“Scalar Register Base Address 


IndexO | Index1 | Index2 | Index3 | Index4 | Index5 | Index6 | Index7 
s= Ges eT (es | | 


+ + + + + + + + 


Addr3 | Addr4 | Addr5 | Adar6 | Addr7 


AddrO | Addri | Addr2 


` | То TLB/ 
Access Address ишк DCACHE 


i i1 i n 1 1] 1 


Clear 979 | ` 
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Visual and Parallel Computing Group 
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ADI’s Revolutionary BF60x Vision Focused 
Digital Signal Processor System On Chip: 
25 Billion Operations/Sec @ 80 mW and Zero 
Bandwidth 


Robert Bushey, Principal Architect & Technologist, 
Processor & Digital Signal Processing Core Products & Technologies Group, ADI 


[35535 


Innovation Has Driven 40+ Years of Real- 
World Signal Processing Leadership 


$2.6B 


2008 


Qn irra: ADI rc эг! hietnrvy fram ADI finanryjal data Увага 9NN9-ONNR ranracant rnntinuinn nnoratinne ANALOG 
2 Source: ADI revenue history trom ADI financial data. Years 2002-2006 represent continuing operations. DEVICES 


Two New Groups 
Collaborating to Address Customer Needs in Market-specific Ways 


Core Products & Strategic Market 
Technologies Group FIRE Group 


MEMS/ i. 
Converters Sensors а, - Automotive 
| Expertise to 
4 -n 
ЛОС Communications 
Z. A Infrastructure 
Processor-DSP Power 
j | Consumer 
` Healthcare 
* Industrial & 
Instrumentation 
É > 


BF60Q 
Dual| Core Blackfin 
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DEVICES Docs 
BF609 BF608 
шар, шар, 
*HD PVP eVGA PVP 
“256КВ 12 “256КВ 12 
ANALOG ANALOG 
DEVICES DEVICES 
BF607 BF606 
шз, шау, 
“256КВ 12 “128КВ 12 
ВЕ54х, 53х "NO PNP 


*No PVP 


BF52x, 51x 


Low Cost Blackfin 


Future 
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Advanced Driver eines Systems(ADAS) 


" | кан. rakini A venien 
> ое & Market Driven DSP Басбет е 
У Real Time © ЗОҒР5 at 1280x960 Pixels/Frame Performance 
> 37 Megapixels / Second Real Time ADAS Analytics 
> Many Parallel and Serial Concurrent Operations / Pixel 
> BILLIONS of Operations / Sec or GOPS 
v Low Power, Low Cost, and Low Bandwidth Constraints 
5 > DEVICES 


"Thr a 


ADSP-BF609 Blac 


kfin Highlights (1) 


* New Pipelined function-level Vision Processor (PVP) for 


embedded vision applications 
e Supports multiple concurrent analytics functions at low price with low 
power consumption 
+ With our new dedicated function level vision processor, broad adoption of 
sophisticated, multi-function analytics can now be feasibly deployed into 
all levels of embedded vision applications 


*Highest performance Blackfin Instruction-level processing 
+ 1GHz of programmable Blackfin instruction level processor performance 
across two cores 
eLarge on-chip memory : 4.3Mbit SRAM & highly efficient system 
bandwidth 
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ADSP-BF609 Blac 


* Feature rich peripheral set & connectivity options 


«Memory interfaces: DDR2, LPDDR, RSI (Removable Storage Interface 
for MMC, SD, SDIO, and CE-ATA) 


e Connectivity: USB2.0, Ethernet, 5 types of serial interfaces, ePPI Video 
Interface for seamless CMOS sensors and LCD connectivity and control 


«Link ports for high speed multiprocessing and inter-chip communication 


* Integration for safety oriented applications 


"n parity, ECC, system protection unit for detecting/recovering from 
aults 


* Delivering lowest power per function 
e Typical power consumption at 25C for the BF609 is 400mW 
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12 SRAM 
With ECC 
256КВ 


(ВҒ606 
With Parity 128КВ) 


SYSTEM CROSSBAR 
AND DMA SUBSYSTEM 


HARDWARE PROCESSING 
BF608/BF609 Only 
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= 
N 
x 

@) 
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1x USB OTG MP Æ 
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Video Subsystem 


ШУТ g) 
(j Wey 
Ig; Wagon 


BF609 is Optimized for Many-way 
Multi-processing With Efficient Inter- chip 
Communication and Control 


ЕРРІ ЕРРІ ЕРРІ 


Сатега Сатега Сатега 


* or unidirectional 16-bit PPI 
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BF609's Masters, Slaves, And Interconnect 


Peripherals 


© 
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MMR Busses 
DMA Busses 


f ШТА бш "дарр, 


ВҒ609 Маео one & я 


ADSP- BF60x_ 


Pixel Crossbar 


а BF60x Introduces a new Video Subsystem(VSS) architecture and 
interconnect: 
а 3 Enhanced Parallel Peripheral Interfaces (ЕРРІ); 


a Pipelined Vision Processor (PVP); 
a Pixel Compositor (PIXC); 


E a Pixel Crossbar m 
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а Data is processed / analyzed before it goes to memory 
a Traffic does not load system bandwidth / power / ЕМІ 
а Raw РР! data can go to memories in parallel 
aData broadcasted to ОМА, PIXC and РУР; 


aMultiple data pathes distribute to L1 / L2 / L3 memories ANALOG 
12 DEVICES 


ШУТ f fer ° 7, № ney 


Pir À | ІП n ICU \ V it sS IO п А Е 3 ro e »C re g e iS S4 a ) d E ] À Vi») 


% Тһе PVP compiles more than 25 billion 
operations per second of vision processing 
while consuming little power and utilizing 
limited memory bandwidth 


саға ут P8 Fe I m ‘Yav, BYJIvV/aY. Pa ` ` T. E 2 гм 
à | | а | г 1 ` <. "ll à 
© | ayi мі | aJo | ( | а = „ > | 
% ç 1 1 = € 1 À L 4 1 ым, ] ae СІ L. = 
? w di i Av y " и | "АА, өн “ w I a w ww 8 wu CAL H WZ X> 


robotic and Machine Vision Systems, as 
well as other adjacent vision/imaging 


+ PVP provides application performance 
across ше енш. тајог агеаѕ: 


e Object D: on 
ө ( )b bj ect Casi ion and Tracking 
e Object Verification 
* PVP works in conjunction with the high 


performance instruction level programmable 
Blackfin DSP cores 


+ PVP reduces required off-chip bandwidth by 
windowing and pre filtering input data 


13 


Example 
Canny or Sobel Edge 
Detection 
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Path Flexibility & Function Level Processing Capabilities 

+ Optimal bandwidth reduced pixel datapaths 

* Function level processing with highly configurable datapath 

+ Enables many computationally complex vision applications 

+ Allows for concurrent support of multiple applications 

e роза Image size (frame rate 30 fps): 1280х960, 1024x768, 
x 


* Supported Pixel-width: up to 16bits 


+ PVP Supports Vision Function Level Processing: Sobel filter & 
Саппу filter(Convolution), Histogram, ARCTAN and Absolute 
Value(Angle and amplitude vectors), Image integration, Pixel 


I 
1 I |Сопуоішіоп Convolution 
NAM PIXC LL) 3x3 3x3 
i 5x5 
: 


 —— — 4 


Convolution Convolution 
Syme аз Edge 
5x5 5x5 Я Class 
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Pipelined Vision Processor (PVP) 
Key Function Level Processing Blocks (1) 


+ 2D Convolution Blocks 
e Supports 1*1, 3*3, 5*5 configurations up to 16-bit input, 16-bit 
coefficient (updateable line by line) 
e Internal 37-bit Acc & Barrel shift 
e Scaled to 32-bit result 


e PVP Initialization via zero filled lines or duplication of the first/last 
line per frame 


+ ALU/Cartesian to Polar Block 
e Input two data-streams at 32-bit 
e Output two 16-bit streams or one 32-bit stream 
e Math operations supported (signed/unsigned) 
+ ADD, SUB, 32-bit multiply, 32-bit divide, Accumulation (xx bit) 


e Shift (logic, arithmetic), ХОН, Masking, Inversion, Arctan, Absolute 
value (x2+y2) 
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Pipelined Vision Processor (PVP) 
Key Function Level Processing Blocks (2) 


* Edge Classification/Packing Block 


e Covers edge enhancement performing non-linear filtering in a pixel 
neighborhood, edge classification based on orientation, sub-pixel 
position interpolation 


e Packs the class, vertical/horizontal sub-pixel position into one byte 
per pixel 


* Threshold/Integral Image Block 


e 16 x 32-bit threshold function (output => 4bits classification, НІС, 
rounding up to nearest threshold, finds max. value) 


e Rudimentary histogram function (16 x 32-bit histogram counter, 
starts relative to the start of frame/line) 
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ШТ , Г, Mang 


dA T 


ADAS Use Case: HD BUE + 


1280х960х30 Сатега Рїре 
monochrom RLE with 16-bit reports 
а Мах. 32 reports рег line 
- кек ] > 
16 


50MP/s 0.5MW/s 


One byte per pixel 
1st Derivative 


DMA (packing) 


50MB/s = 12.5MW/s 


L3 0 MB/s 
L2 52 MB/s 
L1 0 MB/s 
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8bit 
720x480@60fps 


ADV7842 
Video 
Decoder 


Canny Edge Classification 
Single Blackfin Core 


Dot Count 
Filter 


ADV7341 
Video 
Decoder 
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High Perrorimance, Parallelism 
Lower Frequency & Low Power 


« TSMC 65nm GP High Performance Process 
+25 MAC ~ 50KGates 


“ 5 МАС е 25 KGates Number - d Leakage | Dynamic Total 
of MACs (МИ) (mW) | (mW) | (mW) 


+ TSMC 65nm LP Low Power Process 


+ 25 МАС ~ 55KGates 
rchitecture 


Clock 


Leakage | Dynamic Total 
(mW) (mW) (mW) 


Speed 
7 


> 5 GMACs (д 55mW and ZERO incremental BW due to extensive 


pipelining at multiple levels of the architecture and optimized 
function level processing 


+11 mW рег GMAC 
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Video/Image Analysis: Software Architecture 


Blackfin Image Processing 
Toolbox is a collection of 
hundreds of optimized 
functions for image analysis 
& manipulation. 


Few examples are 

Video Automotive beet АНЕ Histogram operations, 
Analytics Analytics —— morphological operations, 
2D convolutions 


Video Analytics Toolbox is a 
set of high level functions 
that are focused on solving 
| Wrapper for PVP Intelligent Video 
Surveillance applications. 
Current release supports 
PVP Hardware foreground Object/Blob 
detection 


Uses Image Processing 


Toolbox functions 
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Application Software 


М. 


ae mage Processing, Automate & 
Industrial Analytics Toolbox Functionality Available 
Today (Includes Hardware Mapping as Appropriate) 


+ Color Conversion 

+ Image Statistical Tools 

+ ADAS Modules 

+ Object & Feature Recognition 


+ Image Filtering 


+ Shape-structure Analysis &Computational Geometry 


> Geometric Transformations 


» Camera Calibration 
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Tradeoffs, Take-aways, & Conclusions (1) 


* An appropriate architectural solution can best be derived from a detailed 
understanding of market and technical requirements acquired through close 
customer collaboration and extensive end product technical domain 
knowledge 


e The pipelined vision processor was architected and defined through 
customer collaboration coupled with general vision and imaging technical 
hardware and software domain expertise 


* Hardware/Software/IP partitioning is very important and ultimately 
determines solution power, performance, and cost 


e Choosing to perform appropriate required functions in software on one or 
more symmetrical or asymmetrical instruction level processors provides 
many advantages including flexibility 

e Partitioning highly computationally complex imaging or vision processing 
into the appropriate hardware functional IP blocks will generally lead to a 
cost optimized, low power, and reduced memory bandwidth solution 


e Optimizing pixel datapaths and flow in an imaging or vision focused SOC 
is very important when defining a low cost and power solution 
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Tradeoffs, Take-aways, & Conclusions (2) 


« Many systems on chip architectures will continue to require functionality 
and IP driven by multiple markets and many applications 


e The BF60x SOC was architected and defined to meet the requirements 
across multiple markets(e.g. automotive ADAS and industrial vision) and 
across many applications(e.g. lane departure warning, traffic sign 
recognition, and barcode reading) 


+ Trade offs involving instruction level processors, function level processors, 
dedicated internally developed IP, and 3'd party IP must be weighed carefully 
in order to arrive at the most optimal SOC architecture and general definition 
e The BF609 contains instruction level digital signal processors, a function 

level processor which is comprised partly of dedicated internally 
developed vision focused IP, and 374 party ІР 


e The selection, partitioning, and definition of these SOC components is vital 
to meeting challenging customer and industry competitive requirements 
* Architecting and defining a highly efficient crossbar interconnect and DDR 
memory controller IP is critical to ensure that the system meets all of the 
bandwidth and latency requirements across many demanding masters 


e Arbitration and prioritization optimization throughout the entire data path 
from master to slave is paramount to satisfying all master requirements 
when executing highly computationally complex vision applications 
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ihe World Leader in High Performance Signal Processing Solutions 
f š 5 

he Email 
q + Robert.Bushey@analog.com 


+ www.analog.com/BlackfinModules 
; + Vision Analytics Toolbox(VAT) 
+ Image Processing Toolbox(IPTBX) 
+ ADAS Vision Analytics Toolbox(AVAT) 
+ 2D Graphics Libraries(BF2DGL) 


• www.analog.com/Blackfin 
€ Blackfin Processors & SOCs 


+ automotive.analog.com 
+ Automotive and ADAS 
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Visconti2 - A Heterogeneous 
Multi-Core SoC for Image- 
Recognition Applications 


Masato Uchiyama, Hideho Arakida, Yasuki Tanabe, 
Tsukasa Ike, Takanori Tamai, Moriyasu Banno 


Toshiba Corporation, Kawasaki, Japan 


Copyright 2012, Toshiba Corporation. 


Outline 


* Background 
* Visconti2 
— Overview of architecture and chip 


— CoHOG accelerator 
(Co-occurrence Histograms of Oriented Gradients) 


* Real Applications 
— Monocular Pedestrian Detection 
— Hand Gesture User Interface (UI) 
* Conclusion 
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Background: Targets of Visconti2 


Image recognition technology ш, A variety of products 


A | Backover ыы | 
Forward collision warning ы n» Door security 


E! Face tracking 


Š š = ” À TN 
Traffic sign recognition — 0 m o0 

: for glassless 3D 
Lane change assistance _ 


Visconti2 designed for | 
- Automotive : Advanced Driver Assistance Systems (ADAS) 
- Consumer 
- Industry 
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Background: Requirements & Approach 


° High ошау of on” fes omen 


or 
5 Non-pedestrian À 


1} 


СОНОС (Со-оссштепсе Histograms of Oriented Gradients) 


“Опе of the most accurate image feature descriptors 
* Toshiba original (T.Watanabe et al., Proc. PSIVT 2008, pp.37-47) 


* High performance 
— E.g. Monocular Pedestrian Detection using CoHOG 
> 3,983ms/frame on 1GHz CPU | 40x speedup required 
* Low power consumption E real-time — 
— Cooling without fan (« 1W in typical condition) 


Hardware accelerators for frequently used tasks which are 
performance bottlenecks (COHOG, etc.) 
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Outline 


e Visconti2 
— Overview of architecture and chip 


— CoHOG accelerator 
(Co-occurrence Histograms of Oriented Gradients) 
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Chip Architecture 


Filter 
[ЙЧ] ЕЕ te s - 
#1 #2 #3 #4 


Crossbar switch 128b x 133MHz 


Асс lerators. 
Ew Вы [Matching | CoHOG Pa 
Ew Вы | Matching | Ema #2 


Crossbar switch 128b x 133MHz 


DDR2 Video I/Fs CAN | [ Misc 
RISC VF In(4ch)/Out(1ch) VF I/F I/F 
Memory Bandwidth 


DDR2: Peak 2GB/sec 
On-chip RAMs: 2GB/sec x 4ch. 


TOSHIBA 
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Multi-core Subsystem 


* Four homogeneous VLIW cores with 256КВ 12% 
— 3-way VLIW core 
* RISC core + 2-way SIMD coprocessor (ISSCC '08[S.Nomura]) 
* Additional 64KB data RAM and DMA controller 
— Exploit multi-grain parallelism 
* Application, task and thread level parallelism: by four cores 
* Data level parallelism: by SIMD coprocessor 


4 Cores 
64KB 64KB 
‘Processor | "=" 
Processor Processor 
19 D$ 19 D$ 


L2$ 256KB 


2-мау SIMD 
Coprocessor 
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Hardware Accelerators 
° Six accelerators implemented 


— CoHOG accelerator 
— Matching accelerator 


— Histogram accelerator / 
— Affine accelerator 


— Two Filter accelerators 


Realizing 
“High performance with 
low power consumption” 


> We adopted “Highly parallelized” 
approach rather than 
“High clock frequency” approach. 


PAU) Ng 
tection 
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СОНОС based Recognition 


Extension to widely-used HOG (Histogram of Oriented 
Gradients) 
1. Make gradient orientation image 


Region of 
Interest (ROI) 


31 co-occurrence 
patterns 


alae! 


8 gradient 
orientations 


Higher accuracy 
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CoHOG Accelerator 
* Throughput: 1 pixel / clock @266MHz 


— 31 co-occurrence pairs are 
calculated in a clock cycle. 


° 31 x 3 arithmetic operations 
e 31 x2 data references 
* Pixel range check 


row 


ЕСІІТІТІІГІН 
Мын!“ ный 


18 
Over 400,000 ROls/sec E : 
(18 x 36 р!хе!$/В О!) 36 88 
Likelihood 


400,000 ROls/sec is enough for our target applications. 
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Features and Chip — 


ШІ 


ү ТІ n TT TIT ТГ (ШІ mE | 
орн? eae ш 


Supply "ЕЖ ” = 
= Viele Out Wis 


Сое 


gems 


[Сонос 
Affine 


ens m Video In VE: : 


= Еее 

3 Accelerator 

EM = 
Hilten Mutieore = 
Processor = 


== 


Bx 
Total peak 464GOPS 
performance 

Power 620GOPS/W 
efficiency 


(Y.Tanabe et al., Proc. ISSCC fP 2. mmm 
2012, pp.222-223) 
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Outline 


* Real Applications 
— Monocular Pedestrian Detection 
— Hand Gesture User Interface (Ul) 
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Real Applications 


e Monocular Pedestrian Detection 


than using stereo camera. | 

— Huge computations are required. EC 
(Sliding window CoHOG recognition 

is used instead of depth estimation 

based on stereo matching (ж 

with stereo camera.) | 


• Hand Gesture UI 
— Hand recognition is applied to many ROls (sliding window 


CoHOG recognition). 
— High frame rate is required. 


Command 
examples 


select cancel 
TOSHIBA 
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Pedestrian Detection : Processing Flow 


Matching 
accelerator 
Track care t Frame 1 
_ pedestrians 1E x Pom E 


Camera input image 


Cluster 27 5 | | 
& Мегде - 


CoHOG 
accelerator Recognize Calculate 
Е EA | using CoHOG distance | 


Alert and/or Braking 
TOSHIBA 
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Pedestrian Detection : CoHOG Recognition 


^ 6 
Miss Hit! 
ІЗІМ 


Air scale down 


° А number of scaled images are = came 
generated by Affine accelerator. „мш | 


— À template is used to match with the scaled 
images: 


* To detect pedestrians in different distances 
* [o detect pedestrians with different body height 

¢ Sliding window CoHOG recognition 
> 650 ROIS / image @ VGA 


* Performance requirement of COHOG 


500 (sliding window ROIS on average) 
x 20 (scaled images) 


X 10 (frame / sec) 
= 100,000 ROls/sec 
< CoHOG accelerator : 400,000 ROIs/sec 
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Pedestrian Detection : Execution Time 


* Execution time per frame 


10fps = 100 msec/frame 


Make scaled images 
Make gradient images Track & Merge 


{GHz CPU |1 Recognize using CoHOG "- 


3983 msec 


Visconti2 


Visconti2 | 


С Affine 


execution = Filter - Matching 
breakdown Make Маке Recognize Track — Multi-core 
scaled gradient usin x & Merge Т coHoc- 


images images Со 
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Real Applications 


e Monocular Pedestrian Detection 


than using stereo camera. | 

— Huge computations are required. K 
(Sliding window СОНОС recognition 

is used instead of depth estimation 

based on stereo matching са 

with stereo camera.) | 


• Hand Gesture UI 
— Hand recognition is applied to many ROls (sliding window 


CoHOG recognition). 
— High frame rate is required. 


Command 
examples 


select cancel 
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Hand Gesture UI : Processing Flow 


e Switching between two processing modes 
— Detection mode : sliding window hand recognition @ 15fps 
— Tracking mode : trajectory recognition @ 30fps 


Camera input image 


Filter 
accelerator 


Pre-processing 
e Scaled Images . 
“ Moving area 


Multi-core 
processor 
accelerator 
Recognize 
Palm and Fist 


Slide window 
hand recognition 
j 5 


Command 
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Hand Gesture UI : Execution Time 
* Execution time per frame in detection mode 
15fps = 66.7msec/frame 


Pre-procegsind 


Post-processing 


1GHz CPU 
Visconti2 


540.9 msec 


Post-processing 


ss ution E EMITE T TENTI 
breakdown Pre-processing Detection part 


С Affine 
... Filter 


* Execution time per frame in tracking mode 
30fps = 33.3msec/frame 


= CoHOG 
— Multi-core 


Real-time Execution 
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Evaluation of Power Consumption 


e Monocular 
Pedestrian Detection Typical condition: 
— Chip total : 870mW Process center sample, 25°C 
— Core (1.1V) :356ту/ 
— PHY(1.8V) : 460mW 
— 1/0 (3.3V) : 54mW 


e Hand Gesture UI 
— Chip total : 891mW 
— Core (1.1V) :363ту/ 
— PHY(1.8V) : 472mW 
— 1/0 (3.3V) : 56т\\/ 


<1W: |< 1W : Cooling without fan without fan —_— TEN q TY 


power measurement environment 


TOSHIBA 
Leading Innovation >>> © 2012 Toshiba Corporation 20 


Conclusion 


e Visconti is a heterogeneous multi-core SoC dedicated 
for image recognition. 


Visconti2 achieves: 
— Accurate recognition 
* CoHOG based image recognition is implemented. 
— High performance with low power consumption 
* We implemented six highly parallelized hardware accelerators. 
* Under 1W power consumption is achieved. (typical condition) 
* Tworeal applications on Visconti2 using HW 
accelerators are demonstrated. 
— Monocular Pedestrian Detection 
— Hand Gesture User Interface 


* Visconti2 status: ES ready 


http://www.semicon.toshiba.co.jp/eng/product/assp/selection/automotive/infotain/visconti/ 
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Centip3De: А 64-Соге, 3D Stacked, 
Near-Threshold System 


Ronald G. Dreslinski 


David Fick, Bharan Giridhar, 

Gyouho Kim, Sangwon Seo, Matthew Fojtik, 
Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, 
Nurrachman Liu, Michael Wieckowski, Gregory Chen, 
Trevor Mudge, Dennis Sylvester, David Blaauw 
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The Problem of Power 


3 2 


Power does not 


әй ғ | decrease at the same 
с = | rate that transistor 
PA - count increases, 
D 1.8 | ina ini 
8 2 | resulting in increased 
5 a energy density 
= 1.2 83 
° N — 
Ф z Circuit supply 

0.6 45 voltages аге по 


longer scaling... 


550 180 130 90 65 45 32 22 
Technology Моде (пт) 


Dynamic We CV;, N Loa А = gate area > scaling 1/62 
dominates A Af С = capacitance > scaling < 1/6 


The emerging dilemma: 
More and more gates can fit on a die, 


but cooling constraints are restricting their use 
University of Michigan 2 


Today: Super-V,,, High Performance, Power Constrained 


B Power Bb Energy / Op Ш Performance 


Super-Vih 


Normalized CPU Metrics 


с 
О 
— 

(au) 

м 

Ф 

Q 
O 
“М. 

> 

O) 

м 

Ф 

c 
LLI 


Large gate overdrive favors 
performance with 
unsustainable power density 


Must design within fixed TDP 


Goal: maintain performance, 


ШТ improved Energy/Operation 


Vih 
Supply Voltage 


University of Michigan d 


Subthreshold Design 


Ш Energy / Ор M Performance 

= Super-Vih 
= 
с 
м. 
Ф 
Q 
O 
— 
> 
2 
Ф B Power Ш Energy / Ор № Performance 
C 100 d : - 
LLI : = 

50 * >5000х “12-16х = >500х 
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Near-Threshold Computing (МТС) 
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Near-Threshold Computing (МТС): 
e >60X power reduction 
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Architectural Impact of МТС 
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Technology Technology 


= Caches have higher Vopt and operating frequency 
= Smaller activity rate when compared to core logic 
= Leakage larger proportion of total power in caches 
= New Architectures Possible 
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Proposed МТС Architecture 


= SRAM is run at a higher Vpp 
= Caches operate faster than core 
= Сап introduce clustered architecture 
= Multiple cores share L1 
= Cores see private L1 
= L1 still provides single-cycle latency 
= Advantages: 
= Less coherence/snoop traffic 
= Larger cache for processes that need it 
= Drawbacks: 
= Core conflicts evicting L1 data 
= Not dominant in simulation 
= Longer interconnect 
= 3D addressable 
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Proposed Boosting Approach 


Cluster 


Measured results for 130nm LP design 


10MHz becomes ~110MHz in 32nm simulation 
140 FO4 delay core 


L1 


4 4 Cores @ 10MHz (650mV) 
i X 

Baseline Cache @ 40MHz (800mV) 
= Cache runs 4x core frequency x Cluster 


= Pipelined cache 


Better Single Thread Performance 1 Core @ 40MHz (850mV) 
= Turn some cores off, speed up the rest Cache @ 80MHz (1.15 V) 


= Cache de-pipelined 
= Faster response time, same throughput 


= Core sees larger cache 
= Faster cores needs larger caches 


1 Core @ 80MHz (1.15V) 
Cache (д 160MHz (1.65V) 
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Cache Timing 


NTC Mode (3/4 Cores) Boost Mode (1/2) 

Low power Low latency 

Tag arrays read first Data and tags read in parallel 
0-1 data arrays accessed 4 data arrays accessed 
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Cache Timing 


IF/DE Stage x EX Stage u Cache Access MEM Stage 
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Access | Tag Тад Data | 
; Read Comp Read ІШе | ссе55 


Edge Edge Edge Edge 
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NTC Mode (3/4 Cores) 
se” RS nn: Low power 

Tag arrays read first 

0-1 data arrays accessed 
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Cache Timing 


IF/DE Stage | ЕХ бізде | Cache Access MEM Stage 


Other — | Ly L— Other 


Access : Tag & Data Tag Compare : 
Read & Mem Access } 
Edge Edge 

A B 


Access 


Boost Mode (1/2) 

Low latency 

Data and tags read in parallel 
4 data arrays accessed 
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Centip3De System Overview 


= 7-Layer NTC 
system 


= 2-Layer system 
completed 
fabrication 
with measured 
results 


= Full 7-layer system 
expected 
End of 2012 
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Centip3De System Overview 


= Cluster architecture 
= 4 Cores/cluster 
= 1kB I$, 8kB D$ 


= Local clock controller „е 
operates cores | 


Disabled Due 
To Redundancy 


Cache Bus Hub Cache Bus Hub 
8x8 Crossbar 8x8 Crossbar 
. e 


Cluster 
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90° Out-of-phase пиш пиш 
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= 1591 F2F connections... _ И 
ШШ 


= Organized into layer 5 
pairs (cachecore) 


Tezzaron 
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DRAM Control Layer 
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= 16 clusters per pair 

El 


Cores have only vertical interconnections 
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Centip3De System Overview 


= Bus interconnect 
architecture = 
= Up to 500 MHZ 
= 9-11 cycle latency — 
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= One per DRAM cache 


Layer 


Cache Bus Hub 
8x8 Crossbar 
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= Each cluster connects 


Bottom 


Core 


to all eight fs 
= 10240 total 
= Vertically connected 5 
through all four layers 
= Flipping interface enables 128-core system 


= 8 lanes, each 128b | 


DRAM Control Layer 
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DRAM Bitcell Layer 
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Centip3De System Overview 


= 3D-Stacked DRAM 


= Tezzaron Octopus 


= 1 control layer 
= 130nm CMOS 


= 1 Gb bitcell layers 
= Up to two layers 
= DRAM process 


= 8x 128b DDR2 
interfaces 
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Centip3De System Overview 
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Centip3De System Overview 
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130nm process 
12.66x5mm per layer 
28.4M device core layer 

18.0M device cache layer 
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2-Layer Stacking Process Evaluated 


Aluminum wirebonding pads 
connected to perimeter 
TSVs like for 7-layer 


VA Wirebonds 


F2F|” 


Cache Layer 


For the measured 2-layer system, 
aluminum wirebond pads were used instead 
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Cache ЗО Connections 


Sea of Gates 
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Core ЗО Connections 
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Cluster 3D Connections 
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1591 F2F Connections 
Each saved -600-1000um in routing 
Prevented wiring congestion around SRAMS 
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DRAM Control Layer 
Tezzaron ӘТТІТІТІТІТТТІТІТІТТІТІТІТТІТІТІТІТІТІТІТІТІТІТТІТІТТІТІТ 
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DRAM Bitcell Layer 
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Die Shot 


=" Looking through back of core-layer 3-1 
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130пт ргосе55 
12.66x5mm per layer 
28.4M device core layer 
18.0M device cache layer 
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System Configurations 


Cache Bus Hub | 4 Core Mode 


160 MHz | 
1.15 Volts Div 4x 
40 MHz 
0.80 Volts 
0 Core Boosted Div 4x 
0 Cores Gated 10 MHz 
0.65 Volts 
Cache Bus Hub| 3 Core Mode 
160 MHz | 
1.15 Volts Div 2x 
80 MHz 
1.15 Volts 


Div 4x 


| | 20 MHz 
0.75 Volts 


3 Cores Boosted 
1 Core Gated 
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Cache Bus Hub 


2 Core Mode 


160 MHz | 
1.15 Volts a 
80 MHz 
1.15 Volts 
2 Core Boosted m en} [eo ie 
2 Cores Gated 2| |2| |> e 
| 0.85 Volts 
Cache Bus Hub | 1 Core Mode 
320 MHz | 
1.6 Volts ae 
160 MHz 
1.65 Volts 
1 Core Boosted с Е 2 2 MI 
3 Cores Gated | 1.15 Volts 
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Measured Results 


Boosting a single cluster 
to 1-core mode requires 209 


disabling, or down-boosting 15% 
[Г] Core Power 


other clusters 1600 ШШ Cache Power 
ШИШ Memory System Power 


1400 
1-core cluster: 
= 15x 4-core clusters 
= 6x 3-core clusters 
= 4.5x 2-core clusters 
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Power (mW) 


Baseline configuration 

depends on TDP and 

processing needs 4-Core 3-Core 2-Core 1-Core 
System Configuration 
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Single-Threaded Performance (DMIPS) 


Measured Results 
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4-Core 


3-Core 2-Core 
System Configuration 
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Measured Results 


Measured Results: 
Centip3De — 3,930 (130nm) 


Industry Comparison: 
ARM A9 — 8,000 (40nm) [1] 


Estimated Results: 
Centip3De - 18,500 (45nm) 


Efficiency (DMIPS/Watt) 


4-Core 3-Core 2-Core 1-Core 


System Configuration 


[1] http://arm.com/products/processors/cortex-a/cortex-a9.php, ARM Ltd, 2011. 
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Conclusion 
= Near threshold computing (NTC) 


= Need low power solutions to maintain ТОР 

= Achieves 10x energy efficiency => 10x more computation to give ТОР 
= Offers optimum balance between performance and energy 

= Allows boosting for single threaded performance (Amdahl's law) 


= Large scale 3D CMP demonstrated 
= 64 cores currently 
= 128 cores + DRAM in the future 
= 3D design shown to be feasible 


= This work was funded and organized with the help of DARPA, 
Tezzaron, ARM, and the National Science Foundation 


University of Michigan 32 


$. XILINX 


ALL PROGRAMMABLEwm 


FPGAs with 28Gb/s Transceivers Built with 
Heterogeneous Stacked-Silicon Interconnects 


Ephrem Wu and Suresh Ramalingam 


© Copyright 2012 Xilinx 


222 PS 
Outline 


1 J Key Application 
2 _) Heterogeneous Stacked-Silicon FPGA Family 
3 ) Stacked-Silicon Packaging 


4 ) Two Types of Stacked-Silicon Interconnects 


N „= eee” © 


Раде 2 © Copyright 2012 Xilinx £ XILINX 3 ALL PROGRAMMABLE. 


4 x100G Optical Interface 


400Gb/s Line Card Application 


Up to 16 x 28 Gb/s 
GTZ Transceivers 


CFP2/CFP4 
Optical Module 


CFP2/CFP4 
Optical Module 


Up to 72 x 13.1 Gb/s 
GTH Transceivers 


Network 


Virtex-7 HT 
Processor 


CFP2/CFP4 
Optical Module 


CFP2/CFP4 
Optical Module 


Hble fhf 


Packet Queues and 
Lookup Memory 
(SRAM, TCAM, 


DRAM) 


Page 3 © Copyright 2012 Xilinx 


Fabric Interface 


Switch Fabric 


£ XILINX > ALL PROGRAMMABLE. 


Network 


GTZ (28G) 


GTH (13G) 


GEL 


Page 4 


XC7VH290T 


ХСТУН580Т 


S FPGA 2 


EE 
S FPGA 2 


2 x 100 Gb/s | 
st 


© Copyright 2012 Xilinx 


XC7VH870T 


4x 100 Gb/s | 
в Oef] 


ХІШМХ 


7 | 


PROGRAMMABLE. 


XC7VH580T Under the Hood 
Industry’s First Heterogeneous FPGA 


Page 5 © Copyright 2012 Xilinx £ XILINX 3 ALL PROGRAMMABLE. 


XC7VH580T Under the Hood 
Industry’s First Heterogeneous FPGA 


Page 6 © Copyright 2012 Xilinx £ XILINX 3 ALL PROGRAMMABLE. 


ererere ance: 01010101010 Microbumps 


C4 Bumps 


BGA Balls 


Page 7 © Copyright 2012 Xilinx $. XILINX > PROGRAMMABLE. 


2 2 
Two Interconnect Types 


Type I 
Between IC and Package 


Type Il 
Between two ICs 


GTZ-IC 


Though Sion — (5 YCIO CCCI) COCCO _ Microbumps 
ia (TSV) Ja D 


Passive Silicon Interposer 


DOOQOOOOQOOQ*:w- 


Ceramic Package Substrate 


BGA Balls 


Page 8 © Copyright 2012 Xilinx € XILINX 3 ALL PROGRAMMABLE. 


ЕЕ eee 
Packaging, Assembly, and Test 


Top Dice, Interposer, & Package 


$: XILINX. 


28nm FPGA & Interposer | 
Ec m 


Package Substrate 


t KYOCERA 


у 


uBump, Die separation 
Joining, & Assembly 
à 


б ^ 
Фітко- | 
Technology^ 


= 


Final Test of Packaged Part 


€ XILINX. 


A 


Page 9 


© Copyright 2012 Xilinx £ XILINX 3 ALL PROGRAMMABLE. 


2 A 
Туре | Example: 28 Gb/s Serial Transmitter 


Simulated 


— O Copyright 2012 Xilinx $ XILINX Э ALL PROGRAMMABLE. 


Туре | Example: 28 Gb/s Serial Transmitter 


ЕЕ 


Simulated 


-r T 1 қас Ë И x Ë 
== (MO = = | тыс» Юн 


Page 11 @ Copyright 2012 Xilinx £ XILINX Э ALL PROGRAMMABLE. 


Раде 12 © Copyright 2012 Xilinx € XILINX > PROGRAMMABLE. 


Microbump 
Pad 


т 
= 
= 
= 
= 
= 
= 
-7 
2 
6 


Раде 13 © Copyright 2012 Xilinx $ XILINX > PROGRAMMABLE. 


Microbump 
Pad 


т 
= 
= 
= 
= 
= 
= 
-7 
2 
= 


Раде 14 © Copyright 2012 Хйіпх $ XILINX > PROGRAMMABLE. 


Microbump 
Pad 


т 
= 
= 
= 
= 
= 
= 
-7 
2 
= 


Microbump 
Pad 


Page 15 © Copyright 2012 Xilinx $ XILINX > PROGRAMMABLE. 


Microbump 
- Pad 


-7 


= 
>= 
- 
22 
= 
fo 


д Microbump 


== 


т Microbump 
Pad 


Interposer 
(Dimensions Not to Scale to Show Interconnect Cross Section) 


Page 16 © Copyright 2012 Xilinx XILINX PROGRAMMABLE. 


- | ) 
u ad 
= 
= 
- 
= 
2 
fo 
i b и т 
wg IC ro 
--- 
--7 
--- 
--- 
--- 
i u m 
B IC ro 
- 
ul cc" 
Sausa 
== 4 
| — ou LS a 
| р 
2 
£ 
7 
--. 
Ds 
CO. 
Se 
— 
Dr al 
ccc 
Es 
o | (4 
| — M 
E 
- сз, 
300299 
-. zT | 
250 сус 
En І n 
` 
ji ` 
39 
БЗ 
~~, = 
о 
D sn 
SE 
T e. 


` 
LT ` 
— 


- 
- 
` 
T. 
CES. 
m = 
DCS 


BLE. 
, PROGRAMMA 

© Copyright 2012 Xilinx . XILINX 

Page 17 


Wire Length Distribution 
Between GTZ-IC and FPGA 


Wire Length Histogram 


г 100.00% 


r 90.00% 


+ 50.00% 


[^4] 
ш 
= 
“. 
о 
ш 
Ф 
2 
Е 
5 
2 


Cumulative Wirelength Distribution (%) 


4138 Nets Total 


T T T T T |! T T T T T 


500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000 10500 More 


Wire Length Bin (шт) 


T T T T 


Page 18 


© Copyright 2012 Хйіпх € XILINX Э ALL PROGRAMMABLE. 


RC Static Timing Analysis for Productivity 
Calibrated RC-Based STA Against RLC-Based SPICE 


| Туре Il Signal 
x Type Il Signal 


x Type Il Signal 


Page 19 


ша) 
mm) 


© Copyright 2012 Xilinx 


Static Timing 
Analyzer 


€ XILINX > ALL PROGRAMMABLE. 


U М 
биттагу 


1 J Presented industry’s first heterogeneous 3D FPGA. 


3 J Reviewed stacked-silicon packaging & supply chain. 


2 _) FPGA & GTZ-IC create three scalable products. | 
4 ) Showed Type | signaling: 28 Gb/s TX over TSVs. | 


5) Lacked 3D timing tools for Type ll signals. 
/ Leveraged STA tools calibrated with RLC SPICE runs. | 


Paga 20 © Copyright 2012 Xilinx Ж XILINX Э ALL PROGRAMMABLE. 


THE ҒОТОВЕ OF WIRELESS NETWORKING 
Marcus Weldon 
CTO Alcatel-Lucent 


COPYRIGHT © 2012 ALCATEL-LUCENT. ALL RIGHTS RESERVED. 
Y FOR AUTHORIZED P SONS HAVING A NEED TO KNOW — PROPRIETARY — 


ERS 


WHAT IS REALLY 
DRIVING THE WHERE IS THE 


(WIRELESS) REAL VALUE ? 
MARKET ? 


COPYRIGHT © 2012 ALCATEL-LUCENT. À 


€00009000000009090000000909090000009009000000909000000090090000090909000099 eesosoocoocsccsccsccecccceccececceccecee Alcatel-Lucent © 
АТ THE SPEED OF IDEAS" 2 
2 ALCA JCENT. ALL RIGHTS RESERVED. 


WHAT 15 
REALLY 


DRIVING THE 


(WIRELESS) 
MARKET ? 


«Фоөоөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөөө j ....!в%Юев%ю%юввевеввеввеевеевеееееееееееегеге» AlCaLel-LUuCent @ 
AT THE SPEED OF IDEAS™ 3 
| COPYRIGHT © 2012 ALCATEL-LUCENT. ALL RIGHTS RESERVED. | 


ТНЕ ТАВІ ЕТ СЕМЕНАТІОМ 
IS ІМ COMMAND 


67% 70% 100% 500М. 


Would cut anything Mobile-only Web users Broadband users Users/month on 
but Mobile BB (UK) in emerging markets microblog (China) Facebook apps platform 


84% 


Choose Internet over 
partner or car (Germany) 


66% 


Sleep with smart 
phone (USA) 


11.5 


Content hours in 7hrs 
by 8-18 years old (USA) 


100M+ 


Tablets sold in 2012 
globally 


USERS KNOW WHAT THEY WANT AND HOW IT SHOULD BE DELIVERED 


099009090099009900090009900900000909009090099009000000990000099009000900090090009900900090009000900099009900990990099 Alcatel-Lucent © 
AT THE SPEED OF IDEAS" Ы 


COPYRIGHT © 2012 ALCATEL-LUCENT. ALL RIGHTS RESERVED. 
ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION 


THE TABLET VERSUS... 
EVERYTHING ELSE 


TABLET ENHANCES TABLET REPLACES THESE 
THESE DEVICES DEVICES 


Connected Smart GPS Gaming Portable Portable eBook NetBook Laptop Desktop 
TV Phone Console Gaming Media PC PC 
Console Player 


Post vs. Pre-Tablet: M Don't use device Ш Use device less Same usage № Use device more 


LIFE CONVERGENCE: WORK/HOME, CELL/WIFI, EVERYTHING EVERYWHERE 


Source: The Nielsen Company 
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ТНЕ МЕТ ЕЕЕЕСТ: 
THIS IS VERY DEMANDING 


Total Daily Traffic By Year (TB) 


800,000 Bell Labs 
Modeling 
600,000 - 
45x - 85x WiFi Home 
Growth CAGR 2011-2016 
400,000 даара 
Metro Cell (WiFi) 24% 
54% 
200,000 . Metro Cell (3G/LTE) 30% 
Smartload 13% 
0 Macro 34% 


2011 2012 2013 2014 2015 2016 


MASSIVE GROWTH IN DEMAND REQUIRES NEW SUPPLY STRATEGY 
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THE SUPPLY LEVERS: 
RADIO CAPACITY 


MORE 
SPECTRUM 
(Hz) 


INCREASE 

CAPACITY 
MORE 
SPATIAL 
EFFICIENCY 
(Bits/Sec/Hz/ 


— >10х 


МОВЕ 
SPECTRAL 

EFFICIENCY 
(Bits/Sec/Hz) 


1.5-2x 


WHAT IS THE IMPACT FOR OPERATORS? 
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OPERATORS HAVE МОТ CAPTURED 
THE FULL VALUE 


OPERATORS CONSUMER REVENUES OPERATORS COSTS AND PROFITS 
ARPU | CAGR 70% ----------------------------------ш----- 
(9) Fixed data 2007-2011 Å- =" =. = " > 
¿=== — 
25 | = 
" -1.3% Adjusted OPEX/ Revenue 
= Fixed voice Ш З 
20 ka. >. 
Sa = -4.4% 
15 TN // 
~~.~_Mobile voice | 
жые Adjusted OP / Revenue 
04 — м 20% --шшШ------------“--------------------- 
= 14.1% _ — 
Mobile data T a м 
3 +1.0% CAPEX / Revenue 
| "Eccc QN EE Eee 
2007 2008 2009 2010 2011 2007 2008 2009 2010 2011 
+2.5% WW revenue: ARPU decline offset by more subs. High pressure on CAPEX to control OP 


UNLOCK THE NETWORK VALUE TO MEET THE USER DEMAND 


Source: Alcatel-Lucent analysis 
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WHERE IS THE 


REAL VALUE ? 


AT THE SPEED OF IDEAS" 
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THE ESSENTIAL BRIDGE 
BETWEEN HAND AND CONTENT 


с & <> 


Handheld Cloud Network Content in cloud 


BEST EXPERIENCE BEST DELIVERY BEST MEDIA AND APPS 


THREE DISTINCT FUNCTIONS А SINGLE UNIVERSE 
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THE NETWORK IS АТ THE ЕРІСЕМТЕН 
AND MATTERS МОНЕ ТНАМ EVER 


Latency 


Total capacity, 
time to load 

Total capacity, 
bandwidth 


Total capacity, 
bandwidth, latency 


STORING 


STREAMING 


Cloud Network 


GAMING 


Ayinoas pue Лэелиа 


COMMUNICATING 


Bandwidth, latency 


THE NETWORK (AND THE OPERATOR) IS CRITICAL TO THE EXPERIENCE 
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THE PROFOUND SHIFT DRIVES 
GLOBAL GROWTH 


“ANALOG” ECONOMY “DIGITAL” ECONOMY “NEXT DIGITAL” ECONOMY 


25 сс 
= ш 
О) © > [75] = 
x О (©) n. TA a 
9 22 ш È 
ка j jH Сс a ыз 
VERTICALIZED PARTIAL RE-CONSTRUCTION 
SECTORS TO DIGITAL INDUSTRIES DEEP SOCIETAL CHANGE 
* Disconnected “ Connected * Hyper-connected 
* Unit of mass markets * Unit of family, friends, colleagues * Unit of one, highly empowered 
* Subscriptions, brand loyalty * Digital cannibalization * A-la-carte user experience 
* Hardware-Defined * Software-Defined * Application-Driven 
* Proximity-based groups * Rise of virtual social groups * Virtual global communities dominate 
* Independent economies * Interdependent markets * Global market and economy 
* Innovation timescale = years * Innovation timescale = months * Innovation timescale = days 


USER ARE MAKING THE MOVE, ARE WE READY? 


eooooooooo...........................................................................................Ailcatel:Lucent @ 
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THE FÜTURE ОҒ NETWORKS 
À PLATFORM FOR EXPERIENCE INNOVATION 


¿> 
EXPERIENCE CREATION AND ENABLEMENT 


DECISION ANALYTICS 
OPTIMIZATION 


CONTROL 


ALL IP INFRASTRUCTURE 
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THE FUTURE ОҒ NETWORKS: WHERE'S THE MONEY? 


Magnitude of Upcoming Change Will be Stunning - 
w сы е с: 


Addressable Market For Re-Imagination — 
| Aggregate Market Сар of Global Public Companies = $36+ Trillion* 
Saj. 
2 са | ° Вгог 2011 2011 
Oo * Fea Revenue EBITDA 
Е ы | . Dif SB SB Top Companies by Mkt Cap 
Š . Ava Financlals 54647 51/35 ICBC, China Construction Bank, Wells Fargo 
5 52 + Fea. Consumer Staples 3,972 543 Wal-Mart. Nestle, P&G, Coca-Cola 
2 $2 * пе» Information Technology 2.298 422 Apple, Microsoft, ІВМ. Google, Samsung 
š | ° АБИ Energy 662 1,068 Exxon Мо, PetroChina, Shell, Chevron 
E * SOC Consumer Discretlonary 4,734 624 Toyota, Amazon.com, McDonald s, Walt Disney 
ЕЗ ' ^99 Heaith Care 2204 455 Johnson & Johnson, Pfizer, Roche, Novartis 
Ж | Industrials 4407 608 General Electric, Slemens, UPS 
рш | Materials 2.607 712 ВНР Billiton, Rio Tinto, Vale 
=. fan Telecommunication Services 2.045 699 China Moblle, AT&T. Telefonica. Vodafone 
KPC KP С Utilities 1 1.50 315 СОР Suez, Natlonal Grid, E.ON. EDF 
Total pum $35,066 56,83 


Source: Mary Meeker, KPCB, Internet Trends 2012 


ооо оо оо ооо ооо оо ооо оо ооо соо ооо ооо ооо оо ооо ооо оо ооо оо ооо оо ооо во ооо оо ооо ооо во ооо во ооо оо ово оо овоовоноее AlCatal-|ycent © 


COPYRIGHT © 2012 ALCATEL-LUCENT. ALL RIGHTS RESERVED. 
ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION 


THE CONSTANT DRIVE ІМ NETWORKING 
MORE EFFICIENCY, LESS SPACE, LESS ENERGY 


Compared to 100G systems 


АХ E 
SPEED | 


AN 4х |>2.6Х -33% ҮТ 
DENSITY | CAPACITY | POWER 


IMS core utilization 
- 40% less 


VIRTUALIZATION CAPACITY 


400G PIC 


POWER 4x less 


ASIC WEIGHT 10x less 
POWER/Mbps -54% SPACE 7x less ЕЕ 
DENSITY x 
400G NP 
MIGRATION: 


COPPER TO FIBER 
POWER SAVING -42% 


TCO SAVING -51% ж 


POWER/Gbps -58% POWER/Gbps -75% ANYG SOC 


DENSITY 8 X DENSITY 10 X 
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THE FUTURE OF WIRELESS IS BIG & SMALL 
LTE MACRO ‘OVERLAY’ AND ‘UNDERLAY’ 


* Software-defined 
radio 

* Wideband antennas 

* Adaptive RF beams 


© © © VOICE OPTIMIZED 


LEGACY 


nu oO EX 


* Software-defined 


0— lightRADIO™ и 
• AnyG baseband e Multiband antennas 
* Baseband pooling B JJ DATA OPTIMIZED * Vertical 
e Virtualization of control sectorization 


SoC 


NEW 4G PLATFORMS DRIVE LOWER TCO FOR MACRO AND METRO 


ee анны CA ыы ee ооо ооо ооо ооо ооо ооо ооо ооо ооо ооо ооо ооо Alcatel-Lucent © 
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THE FÜTURE OF WIRELESS IS BIG & SMALL 
LTE METRO UNDERLAY EXAMPLE 


Delivering high capacity (100 Mbps down, 40 Mbps up) across central Barcelona 


MACRO (51 Sectors) METRO (11 Sectors) 


Plaça Catalunya 
& 


/ КГУ | Diagonal / Gracia 
& 


Fira 
ARTS, AYRE 
Hotels 


400% CAPACITY INCREASE, 40% TCO SAVINGS, 35% LESS POWER 
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RADIO BASEBAND PROCESSING EVOLUTION 
MORE EFFICIENCY, LESS SPACE, LESS ENERGY 


2010 Design 2012 Design 2017 Design 
Discrete L1, L2, L3 System On a Chip (SoC) GPP-Centric 
Processors Integrated L1, L2, L3 but HW-Accelerated 


General Purpose 
Processors (GPPs) 


Multi-Core CPUs 


e FPGAs : L1 (PHY, Turbo Decoders...) * HW Accelerators: L1 * СРР: L3, L2 & some L1 


* DSPs: L1 (Channel Estimation,.), and * Multi-Core DSPs: L1, L2 Processing 
L2 (RLC/MAC, Scheduler,..) * Multi-Core CPUs: L2, L3 * Integrated or discrete HW 
* CPUs: L2, L3 (Transport, Security...) Accelerators: L1 


SYSTEMATICALLY PUSHING TECHNOLOGY LIMITS 
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THE FUTURE OF WIRELESS 
WHAT ABOUT WIFI? 


Wi-Fi Control 
Module 


Automatic network discovery & selection (ANDSF) 
function, enables users to be automatically 
connected to the “best” network 


~100Mbps (Best g 
Effort, Variable ||| 
Experience) " — 
— igi Wi-Fi CONTROL 
9 Ü pn = manan MODULE 
QUO WM = 
О ш mui PACKET CORE 
ы = WLAN Д 
BE GATEWAY um 
~100Mbps (QoS- Ш 
enabled, Constant CELLULAR SMALL PGW/GGSN 
Experience) = 


WLAN Gateway 
Integrated Wifi+LTE Metrocells provide 


Secure Wi-Fi gateway 


‘Anyspectrum’ wireless access solution, functionality for encrypted access 


with ‘Anybackhaul’ option 


SEAMLESS ROAMING BETWEEN CELLULAR AND WIFI NETWORKS BASED ON BEST 
NETWORK FOR APP 


COPYRIGHT © 2012 ALCATEL-LUCENT. ALL RIGHTS RESERVED. 
ALCATEL-LUCENT — CONFIDENTIAL — SOLELY FOR AUTHORIZED PERSONS HAVING A NEED TO KNOW — PROPRIETARY — USE PURSUANT TO COMPANY INSTRUCTION 


THE FUTURE OF WIRELESS 
SMALL CELLS CLOSE THE “DEMAND GAP” 


INCREMENTAL CAPACITY OPTIONS 


Available 2013 


ГЕ lightRadio™ Network 


™ LTE-A Ш AAA E Wi-Fi LTE SmallCells ШІТЕ 


THE FUTURE IS TECHNICALLY REALIZABLE 
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ІМ SUMMARY 


* WHAT IS DRIVING THE MARKET?: The Tablet Generation 
* WHERE IS THE REAL VALUE?: Device + Cloud Network 


e THE NEW REALITY: The Network Platform (using SoCs, NPs, GPPs) 


| pii ПЁ. ға м. 49» 
ді "Im ls ë о < >> 
THE МЕХТ DIGITAL ЕСОМОМҮ ENABLED 
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Адепаа 


ш Stratix V ЕРОА architecture for Floating Point 
ш New Approach: “Fused Data Path” 


ш Throughput, GFLOPs, GFLOPs/W 
= Fri 
— Cholesky Decomposition 
— QR Decomposition 


ш Computational Accuracy 
ш Third Party Benchmarking 
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Altera’s Variable-Precision DSP Block 


у 


Wireless ` 


Set the Precision Dial to Match Your Application 
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Why Floating Point at 28nm 2 


ш Floating point density determined by hard multiplier density 


m Multipliers must efficiently support floating point mantissa 
sizes 
Multipliers vs Stratix III / IV / V 


4500 


4000 


3500 


3000 
ш 18x18 Mults 
m SP ЕР Mults 
m DP ЕР Mults 


2500 


2000 


1500 
1000 


EP3SE110 


65nm 
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Floating Point Multiplier Capabilities 


ш Floating point density determined by hard multiplier density 


m Multipliers must efficiently support floating point mantissa 
sizes 
Multipliers vs Stratix III / IV / V 


4500 


4000 


3500 
3000 


m 18x18 Mults 
m SP FP Mults 
m DP FP Mults 


2500 


2000 


1500 
1000 
500 


ЕРЗ5Е110 EP4SGX230 5SGSB8 


65nm 40nm 28nm 
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Allows High Performance Floating-Point 
in FPGAS 
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New Floating-Point Implementation 


Processor: 


Each Operation IEEE754 


“ Манна анына? Чөреемеі (қолану? “ Манаа! Wamaq? aperui Караам 


Denormalize 


Normalize 
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l 


———— 


Altera Floating Point: 
Fused Datapath 


жу. Mantissa1 Mantissa2 Exponent1 Exponent2 


Slightly 
Larger — 
Wider 
Operands 


True Floating 
Mantissa 


(not just 1.0------- : 
1.99..) 
Remove 


Normalizatio 


Mantissa Exponent 


High Performance 
Low Resources 


/ANU[S А#А\, 


Vector Dot Product Ехатр!е 


Normalize 


e DeNormalize e 
ө ө 
ө 
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Floating Point Functions 


m Math.h 
- SIN 
- COS - LDEXP 
- TAN - FLOOR 
- ASIN - CEIL 
- ACOS - SQRT 
- ATAN - 1/SQRT 
- EXP - DIVIDE 
- LOG - MOD 
- LOG10 


Implemented 
using "Fused 
Datapath” 
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E 

М untitled * =r 
File Edit View Simulation Format Tools DSP Builder Help 
пБ ki k m |< > ++ || к афо [н 


à "X 
n O 
АСоз АЗїп АТап = 
Acc 
Cos Sin Тап Floor 


Sart RecipSqRt Reciprocal Divide 
Mod Exp LdExp Log 
Ready 100% | lode45 


ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. AERA 
® 
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Fast Fourier Transform (FFT) 
Matrix Inversion algorithms 

- Cholesky Decomposition 
- QR Decomposition 
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Altera 28nm high end FPGAs 


Stratix V “GS” Family ЕЕЕ 


Part LES / ALUTs / DSP Multiplier Mbits / M20 14 GBps 
Number ` ALUTs Registers Count memory Transceiver 
blocks Count 
5SGSD3 236K 178K / 356K 1200 13 / 688 24 
5SGSD4 360K 272K / 543K 2088 19 / 957 36 
5SGSD5 457K 345K / 690K 3180 39 / 2014 36 
5SGSD6 583K 440K / 880K 3550 45 / 2320 48 
5SGSD8 695K 525K / 1050K 3926 50 / 2567 48 


@ 2011 Altera Corporation—Confidential 

ALTERA, ARRIA, CYCLONE, HARDCOPY, MAX, MEGACORE, NIOS, QUARTUS & STRATIX are Reg. U.S. Pat. & Tm. Off. ІНЕН 
and Altera marks іп апа outside the U.S. ® 
14 


Fast Fourier Transform (FFT) Performance 
(Mid-size Stratix V, full Floating Point) 


FFT MegaCore 14 Single Precision Floating-point FFT cores, 1,024 pt 


| 
Device: 5SGSD5 Usage 


28 пт Stratix V FPGA: ~1W per Floating-Point FFT Core 
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FPGA verses DSP Processor 


Device Altera Stratix V Texas Instruments 
5SGSD8 TMS320C6678 


Resources 695 kLEs 8 cores, fixed and SP 
50 Mbits block mem floating point 1.25 GHz 
3926 multipliers 
48 TRX (14 GSPS) 

Peak GMACs 2350 320 

(16x16 or 18x18) (3926 multipliers @ 600 (40 GMACs per core) 
Mhz) 

Peak GFLOPs Rating 1000 160 

(single precision) (see 1 TeraFlop (20 GFLOPs per core) 
whitepaper) 

1024 length floating point 3.41 us 10.26 us 

FFT performance (1024 clock cycles@ 300 (12800 clock cycles @ 

(single precision) MHz) 1.25 GHz) 

Aggregate 1024 length 0.17 us 1.28 us 

FFT transform time (20 FF Is per device) (8 FFTs per device, 1 per 


core) 


The Cholesky Decomposition 


ш Ihe Least Squares solution for x in Ax = b 


ш À must be Hermitian (conjugate symmetric) 
— Only lower triangular matrix is needed for calculation 


m |f Ais positive definite, it can be decomposed into lower 
triangular matrix L and conjugate transpose L’ (A= L * L’) 


ш With Cholesky decomposition, x is solved via forward and 
backward substitution with decomposed matrices L and L’ 


ш Cholesky decomposition method is more efficient than LU 
decomposition methods which are suitable for any matrix. 
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ConjugateS ymmetr ic 


0 L, L, LL, Lil, + LL, 
0 0 Lu LL, L.L, +L,L, Lals зо + LaaLa La + Lip + Газ + Гад 


À; = у Le ж . where j is the column index of the matrix 
J J J 


ie 


JJ 


Li74 A. 
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Off-diagonal Elements 


Lı 0 0 0 Lı Г L;; Ly 


= L, 1 L. 0 0 0 L L,, Ly = 
L, 1 L, Г. 0 0 0 Г, 3 L4 


L.L + LL 
4115. + Lobo; + LaL i + Е. + Газ + ү 


ConjugateS ymmetr ic 


2 2 2 
Lı + L,, + L,, 


А. == 2 l. ж Е. where і and | are the row and column indices of the matrix 
) 1 J 


] 
А. = 2. L. ж con] (Т) where Li; is the transpose of Т. 
J 1 J 
k=l 


jl 
À, - b3 L, * conj (Т) 
L. - k=1 


Г. 


JJ 
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Equation 2 


j-l 


À; — > * соп) (к) 
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Forward Substitution 


We now have L апа L thus A*x=b > L "L x = 
име define: Y= L'*x > L*y=b 


Lis the lower triangular matrix, V and b are column matrices and b is known in the system so Y can be solved by forward substitution 
1—1 
— * 
b j Ук L. 
= К=] 


je 


Equation 3 


L. 


JJ 


Since equations are similar, 


; ! И Cholesky decomposition апа 
Note that solving for Y is very similar to solving for L shown below forward substitution are 


combined into the same 
process. The only difference 
is that Eq 2 is conjugated 


E 
À, - yx * conj (Li) 
Loo ko И 


s | | 


JJ 


Equation 2 
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Backward Substitution 


X can be solved by backward substitution, qm X = у 


J 
Since L is an upper triangular matrix, x has to be solved from the bottom to the top, hence why it's called back substitution 


VS 
_ Ж 
M уЗ X, * Lik 
k=j+1 Equation 4 
[ L 


JJ 
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Cholesky Block Diagram 


Solve for x in Ax = b where A 
Is conjugate symmetric 
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Performance and ҒРСА Resources 


Cholesky Decomposition Parameterizable Core using 5SGSD5 


Complex Vector ALUTs / % ALUTs / Latency @ GFLOPS per 
Input Matrix Size Memory % Memory Operating core (complex 
Size blocks / blocks / frequency single 

precision) 
27х275 % 27х275 
30х30 30 76.5К 22% 255 us (д 2187 
793 M20K 39% 250 MHz 
146 DSP 9% 
60x60 60 141K 41% 328 us @ 39.0 
955 M20K 47% 235 MHz 
268 DSP 17% 
240x240 60 154K 45% 922 us @ 74.2 
1820 М20К 90% 220 MHz 
268 DSP 17% 
360x360 90 204K 59% 1103 us @ 91.8 
1411 M20K 70% 190 MHz 
391 DSP 25% 
400x400 100 220K 64% 1342 us @ 103 
1619 M20K 80% 190 MHz 


430 DSP 21% 


GFLOPs апа GFLOPs/Watt 


Cholesky Decomposition Parameterizable Core using 5SGSD5 


Complex Vector Through-put GFLOPS рег Core power GFLOPs/Watt 
Input Matrix Size (Matrix per соге (complex consumption as 
Size second) single precision) measured using 
(n X n) Altera 5SGSD5 
eval board 
30x30 30 472,464 21.7 7.7 W 2.8 
60x60 60 118,858 39.0 13.6 W 2.9 
240x240 60 8,467 74.2 14.0 W 5.3 
360x360 90 1142 91.8 14.7 W 6.2 
400x400 100 1182 103 16.1 W 6.4 


Complex Cholesky FLOPs = 4/3nš + 8n° 
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Competive Results: Nvidia GPU 


Matrix GFLOPs with GFLOPs with GFLOPSs with 


Size LAPACK Magma Nvidia OpenCL 
Library Library Library 

512x512 20 22 58 

768х768 20 39 82 

1024х1024 36 57 68 

2048х2048 60 117 96 
Cholesky FLOPs = 4 №/3, where М is matrix High Performance 
dimension Relevance Vector 


Machine on GPUs 
Depeng Yang, Getao 


m Results in about 0.25 GFLOPs/Watt (512x512) Liang, David 


Jenkins,Gregory D. 
m Nividia GTX480 rated at 977 GFLOPs 4. 
U of Tennessee, 


m Intel Pentium4 3.7 СОН? rated а! 14,8 GFLOPS Knoxville 
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More Nvidia Results 


Matrix 
Size 


1024x1024 
2048x2048 
3072x3072 
4032x4032 


CPU GFLOPs GPU GFLOPs 


24.2 51.4 

26.5 111.7 
27.5 151.6 
29.96 183.02 


Using Magma 1.0 RC5 library 


GPU speedup 


3.1 
32 
6.5 
7.1 


m Nvidia Fermi Tesla C2050, 1147.0 MHz clock 


m AMD Quadro NVS 290, 918.0 MHz clock 
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MAGMA LAPACK for GPUs 

Stan Tomov, Research Director, Innovative 
Computing Laboratory 

Department of Computer Science 
University of Tennessee, Knoxville 


/ANU[S А#А\, 


ОН Decomposition 


ш QR Solver finds solution for Ax=b linear equation system 
using QR decomposition, where Q is ortho-normal and R 
is upper-triangular matrix. А сап be rectangular. 


а Steps of Solver 


Decomposition: A-Q.H 
—  Ortho-normal property: О.О = 
- Substitute then mult by ОТ: Q.R.x=b R.x=QI .b=y 
- Backward Substitution: Q’.b=y solve R. х= у 


m Decomposition is done using Gram-Schmidt derived 
algorithms. Most of computational effort is in “dot-product” 
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Block Diagram 


Solve for x іп Ах = Б where А і5 non- 
symmetric, may be rectangular 
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Performance and ҒРСА Resources 


QR Decomposition Parameterizable Core using 5SGSD5 


Complex Vector ALUTs / % ALUTs / Latency @ GFLOPS per 
Input Matrix Size Memory % Memory Operating core (complex 
Size blocks / blocks / frequency single 
precision) 
27х275 % 27х275 
50х100 50 105К 30% 45 us (д 43.8 
230 M20K 11% 250 MHz 
227 DSP 14% 
100x200 50 106K 31% 213 us @ 64.3 
304 M20K 15% 250 MHz 
228 DSP 14% 
100x200 100 202K 58% 173 us @ 91.9 
504 M20K 25% 200 MHz 
428 DSP 27% 
250x400 100 200K 58% 1586 us @ 106 
858 M20K 43% 200 MHZ 
428 DSP 27% 
400x400 100 203K 59% 4029 us @ 106 
1566 M20K 78% 200 MHz 


428 DSP 21% 


GFLOPs апа GFLOPs/Watt 


QR Decomposition Parameterizable Core using 5SGSD5 


Complex Vector Through-put GFLOPS per Core power GFLOPs/Watt 
Input Matrix Size (Matrix per соге (complex consumption as 
Size second) single precision) measured using 
(n X m) Altera 5SGSD5 
eval board 
50x100 50 31,681 43.8 10.8 W 4.1 
100x200 50 5,920 64.3 13.9 W 4.6 
100x200 100 8,467 91.9 21.0 W 4.4 
400x400 100 310 106 25.2 W 4.2 
450x450 75 165 80.0 20.2 4.0 


Complex QRD FLOPs = 5.33mn? + 8mn - 2n + 4n? 
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Computational error analysis 


QR Decomposition Accuracy 


Complex Input Vector Size MATLAB using 
Matrix Size computer Norm/Max 
(n x m) 
50x100 50 5.01e-5 / 6.42e-6 
100x200 100 2.3е-5 / 1.24e-6 
400x400 100 8.8e-5 / 4.81e-6 


using Frobenius norm |E].. 


Cholesky Decomposition results are similar 
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DSPBA generated RTL 
Norm/Max 


4.87e-5 / 6.02e-6 


1.68e-5 / 9.97e-7 


7.07e-5 / 4.03e-6 


биттагу 


ш High performance floating point designs сап be built using 
FPGAs 
— High density of 27x27, 36x36, 54x54, 72x72 multipliers available at 28nm 
— New floating point toolflow reduces routing density to sustainable level 
— Availability of optimized math.h library of floating point functions 


ш FPGA Fixed point parellelism performance benefits now 
carry over into floating point 


m Best іп class GFLOPs / Watts 


= Real-world, not marketing, floating point benchmarks for 
comparison 
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Ап 1А-З2 Processor with a Wide Voltage Operating 
Range in 32nm CMOS 


Gregory Ruhl, Saurabh Dighe, Shailendra Jain, Surhud Khare, 
Satish Yada, Ambili V, Ргауееп Salihundam, Shiva Ватапі, 
Sriram Muthukumar, Srinivasan М, Arun Kumar, Shasi Kumar, 
Rajaraman Ramanarayanan, Vasantha Erraguntla, Jason 
Howard, Sriram Vangal, Paolo Aseron, Howard Wilson, Nitin 
Borkar 

Microprocessor & Programming Research, Intel Labs 
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Purpose 
Claremont: Near Threshold Voltage and Wide Dynamic Range IA Core 


Demonstrate energy benefits of Extend dynamic range of Advance low voltage, variation 
Near Threshold Voltage (NTV) operation from NTV to Утах aware and multi-corner design 
computing to IA for energy efficient performance methodologies 


Agenda 


* Design Challenges 
e Claremont Prototype 


* Design Strategies and Methodologies 
* NIV Design 
e Wide Dynamic Range Design 

e Results and Summary 


Design Challenges 


e Reduced | „, noise margins and variability results in circuit functional 
failures 


* Power/performance profile becomes extremely sensitive to PVT variations 
° Tools and methodologies are not mature for low voltage designs 


e Wide dynamic range design convergence is complicated by 
o Disproportionate device vs. interconnect delay scaling 


o Multiple voltage domains 


Agenda 


e Design Challenges 
e Claremont Prototype 


e Design Strategies and Methodologies 
° NIV Design 
° Wide Dynamic Range Design 


‚ Results and Summary 


Р54С IA-32 Core Background 


Legacy Pentium ® core (1994) 
32-bit CPU with 64-bit data bus 


Superscalar, in-order pipeline 
architecture with pipelined floating point 
unit 


Dynamic branch prediction 
Separate code and data caches (8KB) 


Fractional bus operation allowing core 
frequencies higher than 66MHz FSB 


Normalized Energy/Cycle 


Setting the Design Targets 
Energy Estimation for P54C 1А core 


Sub-Threshold Near-Threshold Super-Threshold 


Low V+ 


0 0.2 0.4 0.6 0.8 1 1.2 
Voltage/Frequency operating points 


~5X efficiency improvement with aggressive voltage scaling 


Claremont Prototype 


P54C 1А Core in 32nm CMOS 


NE EL. | Aggressive Voltage Scaling 


DISSE ІШІ ШІП TI ООО 
| pus о Vmin Target: 0.5V (Logic) and 0.55V (RF) 


Target 


wasa o Low Voltage, Variation Aware Design 
a - 
po 


=m Ultra Low Power Design 
Generate Generate 
тынын 0. о Sub-20mW Total Core Power Target at 0.5\/ 


ALU ALU 
ЕЕ ШЕГЕН Wide Dynamic Operating Range 


! o Triple Corner Optimization 
o 0.5V/66MHz, 0.75 V/333MHz, 1.05V/525MHz 


Level Shifters 


Proactive Power Management 


* Instruction-driven power gating 


— Next Inst. ID 


> Current Inst. ID e Dynamically turn-off FPU during idle periods 
->FPU Stall/Reset 


FPU Clock 
e 65% FPU sleep enabled 


dés ° Single cycle wake-up 


Counter 


Floating 


Point unit || | ° Programmable sleep-threshold 


(FPU) 


* Based on application and operating point 


PF: Prefetch Е1, E2: Execute 
D1, D2: Decode WB: Write-back 


С: Сасһе Ассе$$ ЕВ: ЕР Error ө Energy saved > Wake-up overheads 


Workload-aware, Нпе-агат power management inteD 
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Agenda 


e Design Challenges 
е Claremont Prototype 


* Design Strategies and Methodologies 
* NIV Design 
° Wide Dynamic Range Design 


‚ Results and Summary 


Near Threshold Voltage Logic 


e Variation aware library pruning to ensure reliable NIV operation 


e Limited transistor stacks to З, No wide ТО muxes, No contention 
circuits 


e Pruned minimum sized and low drive strength cells 
e Sequentials redesign with interruptible and upsized keepers 
e 10T single ended transmission gate register file cell topology 


e Semi Interruptible split output level shifters 


* Full swing 3.3V I/O for legacy board compatibility intel 
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Low Voltage Timing Convergence 
Random/Systematic process variations = Path delay uncertainty = Max/min failures 


e Max convergence strategy 
о Setup violations are not fatal; Сап be corrected by relaxing PV phase 
о Conservative library pruning > Low variation impact > Better PV to silicon correlation 


о Shallow P54C pipeline (~75 gate stages) > Averaging effect of random variations 


* Min convergence strategy 
o Hold violations are fatal; Important to ensure sufficient hold margins 


o Tango charges data/clock path variation; Need to consider sequential hold variation as well.. 


“Variation aware" timing convergence is essential 


Hold Margin Guard Banding 


Hold time variation characterization using NOVA/MPP2 


| | Hold Time TZST, 20С 
Sequential Variants 


ШЕ, 


/ 2X Local Clock Inverters 136 р5 786 р5 


Variation aware hold margin guard banding for robust sub-0.5V operation 


Agenda 


e Design Challenges 
е Claremont Prototype 


* Design Strategies and Methodologies 
* NTV Design 
e Wide Dynamic Range Design 


‚ Results and Summary 


Wide Dynamic Range Design Challenge 


P1269.4 Critical Path at 0.5\/ Critical Path at 1.05V 


98% 75% 


2% 25% 


The change іп critical path distribution across the wide voltage range significantly 
increases timing optimization efforts to achieve targeted frequency at a given voltage 
corner 


Wide Dynamic Range Design Optimizations 


Timing Metrics 
Synthesis Corner Evaluation 0.75V 0.5V 
WNS TNS Fmax WNS TNS Fmax 


e Insignificant ICD at ULV “> Suboptimal P&R Ə ICD dominated critical paths at high 
voltages 
о Prioritized ICD dominated paths at partition level before placement 


e Synthesis constraints: Superset of ^t -- === atts required across voltage range 
Multi corner constraints for single corner synthesis 


Multi-Voltage Clocking апа Skew Management 


RLS 
Voltage 
Domain 


ж 
| = ^^ 
ы ogrammable 


„ее 


ЕВВ Voltage 


Core logic (RLS) & memories (ЕВВ) operate 
on independent power supply 


Level shifters in clock distribution network 
Inter-block skew is voltage dependent 


Programmable delay buffers for the skew 
management, configured via scan 


1.5-2X skew reduction across different RLS, 
EBB voltage combinations 


Enhanced РУ Methodology 


* Modified intra-block & inter-block skew function in 
Tango for multi-voltage timing analysis and roll- 
ups 


Е. Incorporated block Voltage Mapping (VM) table 
and Voltage Dependent Skew (VDS) table in 
Tango environment 


° Skew is computed based on operating voltage of 
launching and sampling block using entries in VM 
& VDS tables 


Agenda 


e Design Challenges 
е Claremont Prototype 


e Design Strategies and Methodologies 
‚ NIV Design 
° Wide Dynamic Range Design 


e Results and Summary 


Claremont Die Micrograph and Test Setu 


à Я £ 
CLAREMONT f fi 


Technology 32nm High-K Metal Gate Technology 
1 Poly, 9 Metal (Cu) 


| 
| 


Claremont Test Challenges 
„от 4 ° 18 year old motherboard 
Vi Je x . Age variation 

* Most peripherals fail below 15Mhz 


* Front side bus spec timing and 

а ты, voltage challenges 

S 077. ° + Lack of uBreak points & advanced 
debug features 
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Total Core Power (mW) 


Power/Performance Characteristics 


700.0 


Туріса! Skew, 20°С 


600.0 


500.0 


400.0 


300.0 


200.0 


100.0 


0.0 


0.38 04 045 05 055 046 0.7 0.8 0.9 1 1.1 
0.65 0.65 0.65 0.65 0.65 0.65 0.7 0.8 0.9 


1.1 Logic Усс/ Memory Vcc (М) 
Wide Dynamic Range 1.1V/741MHz/445mW to 380mV/10MHz/1.5mW 


1000 


(ZHIN) хешу 8109 


Total Power Breakdown 
Pdyn Logic _ Pdyn Memory 


= Plkg Logic Plkg Memory 
100% 


90% 
80% 
10% 
60% 
90% 
40% 
30% 
20% 


10% ~ Typical Skew, 20% 
0% 


0.38 0.4 0.45 0.5 0.55 0.6 0.7 0.8 0.9 1.1 
0.65 0.65 0.65 0.65 0.65 0.65 0.7 0.8 0.9 
1.1 Logic Vcc/ Memory Vcc (V) 


Leakage power scales from 3% @1.1V to 50% @ 0.38V 


Energy/Cycle (nJ) 


Energy/Cycle: Typical Skew 


Using BIST Typical Skew, 20°C 


Total Energy 


Dynamic Energy 


Leakage Energy 


0.45V 


0.38 04 045 05 0.55 06 07 08 09 1.1 
0.65 065 0.65 0.65 0.65 0.65 0.7 0.8 0.9 
1.1 Logic Vcc/ Memory Vcc (М) 

4.5X Energy reduction from Vmax: 135pJ/cycle at 450mV 


Temperature Dependence: Туріса! Skew 


Energy/Cycle (nJ) 


0.35 0375 04 0425 045 0475 05 0525 0.55 0575 06 
Logic Vcc/ Memory Vcc (650mV) 
25% increase from 5C to 60C 


Energy/Cycle (nJ) 


Energy/Cycle: Fast Skew 


Using BIST Fast Skew, 5°C 


Total Energy 


Dynamic Energy 


Leakage Energy 


0.34 0.4 0.5 0.6 0./ 0.8 0.9 1.05 
0.56 0.56 0.56 0.6 0.7 0.8 0.9 
1.05 Logic Vcc/ Memory Vcc (V) 


3X Energy reduction from Vmax: 200pJ/cycle at 500mV 


Area Penalty per Technique 


Technique Increase from 
Audit DB 


Modified Sequentials 27% 
0.5V 66MHZ Target 24% 


Complex cells Pruning 10% 
3 RLS blocks vs. 1 8% 
Min Z 5% 
Additional Min fixing 2% 


е Area overhead is a non-linear function of Vccmin improvement. 
* Incremental Vccmin improvement is more practical and will have а 


lower penalty 


Claremont: Industry s First NTV IA Core 


380mV 293Х 


| | 0.38 Total Power Reduction 
Теле Legie Vau | тот Утах іо Утїп 
Wigi = 
Talal Core Power ! Total Energy Savings 
— hat Vmin from Vmax to Vopt 
А0 silicon booting Multiple O/S 
Measured using BIST (intel) 
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Conclusion 


• МТУ: A Promising Technology for Energy Efficient Computing 
о Beneficial for ultra-low power IA with modest performance demands 
o Energy efficient SOCs, Graphics, Sensor hubs, Many-core CPUs, Exascale... 
e Claremont Demonstrated Reliable МТУ Operation, Enabled by 
о Novel circuit design techniques for logic, sequentials and memories 
о Variation aware design convergence strategies 
° Next Steps: 
о Low overhead NTV circuits, ULV standard cell libraries 
o CAD Methodologies for Low Vcc & WDR designs : SSTA, Multipoint optimization 
o Device — Circuits — Architecture co-design for Near Threshold Computing 


Intel Confidential 


Near Threshold Voltage Logic 


e Variation aware library pruning to ensure 
reliable МТУ operation 


° Limited transistor stacks to 3, No wide ТО 
muxes, No contention circuits 


e Pruned minimum sized and low drive 
Strength cells: Minimum Z allowed is 2X 
process Zin 


e Only 40% of combinational standard cells in 
the library used in the design 


Re-characterized constrained standard cell library at 0.5V NTV corner 


(intel) 
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Near Threshold Voltage Sequentials 


e Modified-C^MOS topology with interruptible 
and upsized keepers 


e Slave TG in conventional flip-flop replaced 
by clocked inverter, eliminating risk of write- 
back failures due to charge sharing 


* Keepers upsized by 2X to achieve retention 
Vmin of 0.5V (RSSS, 5.50, -25°C, 1.1MO 


Rg) 


* Designed 13 custom sequential flavors and 
re-characterized at 0.5V 
Robust, write-back free sequentials with 7OmV retention Vmin improvement 


(inte) 
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Low Voltage RFs 


e 10-T single ended transmission gate 
(SETG) latch topology 


* Fully interruptible bit cell for contention free 
writes 


* Retention limited cell, upsized (from 3- Track 
to 5-Irack) to achieve retention Vmin of 
550mV (RSSS, 5.90, -25°C, 1.1МО Rg) 


* Programmable keepers (3 vs. 4 stack) 
during read 


intel) 
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Near Threshold Voltage Level Shifters 


semi-interruptible split output level shifter 
VCC High topology 


Interruptible PMOS reduces contention, split 
output decouples delay path from latch 


stage 
OUT 


Asymmetric critical path based sizing 


Two-stage level shifters between RLS & 
М ЕВВ with intermediate reference voltage 


Wide range level translation: sub-0.5V to 
1.1V 
60% performance improvement over single stage, symmetric level shifters 
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Custom Interposer 


E. ang Evaluation Board 
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Legacy Pentium" Socket-7 Motherboard 
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Legacy Benchmarks 


Whetstone (inc. FPU): 


e MWIPS MFLOPS УАХМІР5 MWIPS-DP Linpack: (FPU heavy) 
* Pentium 100 66.2 16.8 97.8 66.2 1994 Opt No opt 
e Pentium 120 79.5 20.2 118 81.6 1995 e 80486 DX2 66 2.63 1.74 
e Pentium 133 88.3 22.4 130 90.8 1995 e Pentium 75 7.56 4.04 
e Pentium 100 12.07 5.40 
Dhrystone: (no FPU) e Pentium 133 17.05 5.60 
Dhry1 Dhry1 Dhry2 Dhry2 e Pentium 166 19.89 6.86 


Opt NoOpt Opt NoOpt 
e 80486 DX2 66 45.1 12.0 35.3 12.4 
e Pentium 75 112 19.3 87.1 18.9 
* Pentium 100 169 31.8 122 32.2 
* Pentium 133 239 38.3 181 39.0 
* Pentium 166 270 43.6 189 43.9 
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Reducing Transistor Variability For High 
Performance Low Power Chips 
HOT Chips 24 


Dr Robert Rogenmoser 
senior Vice President Product Development & Engineering 


Copyright © 2012 SuVolta, Inc. All rights reserved. 


Overview 


= Transistor Variability Limits Chips 
= Impact on Mobile System on Chip (SOC) 
= Limited Low Power Design Techniques 
= Where does Variability come from? 


= New Transistor Alternatives to Reduce Variability 
= Deeply Depleted Channel (DDC) technology 
= Silicon Impact 


= Outlook 
= Taking advantage of Deeply Depleted Channel (DDC) in Mobile SOC 


m 
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What is needed in Mobile System on Chip? 


m -— 
GPU! (©? С) 
5e. Es 


nn 


oo -- 


= Multiple blocks with different performance requirements 
= Integrated on the same die 
= Different power modes — would like to run at different supplies 
= Multiple V+ transistors used to control leakage 
= Single chip solution requires analog integration 


" Need co-design ої architecture, circuits and transistor 
technology for best solution 
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Variability Limits Design & Architecture 


= Limited benefit using voltage scaling (DVFS) 
= Cannot overdrive much due to reliability and power restrictions 
= Dynamically lowering voltage limited to 100-200mV 
= Only lowering frequency leaves large leakage power 
= “Run to hold” beats DVFS despite overhead 


= Finicky SRAM memories 
= High SRAM Vy leaves no room for memory voltage scaling 
= Many circuit tricks to improve Vy, and noise margins 
= Design teams moved to dedicated power rail for SRAM 
= Works for CPU - difficult in GPU 
= Impacts power network integrity — more fluctuations 


= Transistor variability limits chips 


m 
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Transistor Variation Source of Chip Variation 


= Global/Systematic/Manufacturing Variation 


= Shifts all the transistors similarly 
= Longer/shorter transistor lengths 
= More (or less) implant energy and dose 
> Will result in speed/power distribution 


= Local/Random Variation 
= Transistor next to each other vary widely 
= Small number of dopants in transistor channel 
> Random Dopant Fluctuation (RDF) 
= Apparent in threshold voltage mismatch (oV;) 
= Impacts speed, leakage, SRAM & Analog 
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> Industry solution: Remove RDF using Undoped Channel 


» What is the right silicon roadmap going forward? 
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Transistor Alternatives 


= FinFET or TriGate 
= Promises high drive current 
= Manufacturing, cost, and IP challenge 
= Doped channel to enable multi V+ 


У 


Техіроок FinFET Intel TriGate 
= FDSOI 


" Showing off undoped channel benefits 
= Good body effect, but lack of multi V+ capability 
" Restricted supply chain 


eee 
X 10nm Si film 


50 nm Source: IMEC 


= DDC — Deeply Depleted Channel transistor 
= Straight forward insertion into Bulk Planar CMOS 
= Undoped channel to reduce random variability 
= Good body effect and multi V- transistors 
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Deeply Depleted Channel™ (DDC) Transistor 
ә Undoped or very lightly doped region 


= Significantly reduced transistor random variability oV- 
> Lower leakage 


> Better SRAM (lea, lower Vmin & Veer) 
> Tighter corners 


> Smaller area analog design 


= Higher channel mobility (increased I. lower DIBL) 
> Higher speed, improved voltage scaling 


© V, setting offset region 

= Enables multiple threshold voltages 
Ө Screening region 

= Strong body coefficient 


> Bias bodies to tighten manufacturing distribution 
> Body biasing to compensate for temperature and aging 


"Example implementation 


Benefits similar to FinFET in planar bulk CMOS 
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Lower Transistor Variability Reduces Геакаде 


High leakage 
* tail dominates OV: 
“power —0.06V 


65nm Silicon 
SRAM V | ---0.03У 


[Leakage Power] 


mm 
o 
>= 

- б 

md 

© 
@ 

1c 
9 
= 
TH 
а 


2.7Х higher power 
(Model using 85mV 
subV; slope) 


High V4 tail 
N viti down ICs 


= Transistor variability is reflected in threshold voltage (Ут) distribution 
= Leakage current is exponentially dependent on V+ 

= Lower V, variability (oV-) reduces number of leaky low V+ devices 

= Power dissipation is dominated by low V- edge of distribution 

= Smaller oV- > Less leakage power for digital and memory/SRAM 
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Lower Transistor Variability Improves Speed 


65nm Silicon 
Measurement 


(C) 
"4 


T l 
450 Delav[ps] 500 


= Nominal (TT) ring oscillator speed expected to бе 400р (А) 

= Equivalent to having many similar critical paths in a chip 

= V- variation will randomly affect paths within the same die limiting speed to 470ps 
= Undoped channel reduces variability and increases mobility (B) 

= 25% faster mean, 30% faster tail due to tighter distribution 
= To match performance lower Vpp until tails have same speed (С) 

= Large impact on power due square dependence P=CV?f «IV 


— 
9 | HotChips 2012 | Copyright © 2012 SuVolta, Inc. All rights reserved. SUVOLTA’ 


Lower Variability Improves Transistor Matching 


1.0 


= SRAM memories built using 6-T SRAM cell «s 


= Smallest transistors on every chip, 
worst V+ mismatch 


= Higher Vpp is required to avoid failures 


Node 2 [V] 
° ° 
> o 


(a) DDC 


= Demonstrated SRAM to V min of 0.425V 2 кеа 


= In analog circuits, matching is key 

= Large transistors used to improve relative 
variability in current mirrors, differential pairs, etc. 

= Better transistor matching allows for 
= Area savings 
= Higher performance 
= Lower power 

= Undoped channel improves 
Rout > higher gain 
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0.0 02 04 06 08 14 
Node 1 [V] 


65nm Silicon 
Measurement 
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Better Chips with Body Biasing 


Too slow 


Too hot 


- Body Bias to fix systematic variation 


" Speed-up (forward bias - FBB) slow parts 300 
" Cool down (reverse bias - RBB) hot parts 250 - 


> Increase manufacturing yield 


= Body bias enables multiple modes of 


operation 
= Active > minimize power at every performance z^ 
= Standby > leakage reduction, power gating 50 
в 0 | | 
" DDC provides 2-4х larger body factor 65nm 
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Body coefficient [mV/V] 
о 
о 


m DDC 
ш Сопу. Halo Bulk Planar 
ш FinFET 


| 


40nm 


TCAD 
prediction 


28nm  20nm  14nm 
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На! the Power at Matched Performance 


65nm Silicon 
Measurement 


Power [uW] 


FF 
FF 


_Baseline 
speed 


- 100% power д _ _ _ 
TT 


455 


А Baseline norm.f 


À DDC w/ body bias 


500 Frequency [MHz] 


= Inverter ring-oscillators (RO) fabricated at process corners 
= Baseline © 1.2V Vpn and DDC @ 0.9V Vp, 


= For each corner, DDC RO is faster and lower power 


= Using strong body coefficient to pull in corners 
= Half the power (50% less power) while matching speed 


м 
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Tighter Manufacturing Corners w/ DDC 


Better process control 
leads to tighter corners 


" Manufacturing flow further 
reduces layout effects 


1 sigma tighter wafer to 
wafer and within wafer 
variation for DDC 


= Less overdesign as max 
paths and min (hold) 
paths are closer 


Faster design closure 
> earlier tapeout 
> shorter ТТМ 
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; 65nm Silicon 
Measurement 


-0.35 
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Voltage Scaling to 0.6V Vj, 


5.00E-04 
65nm Silicon 
Measurement FF 


Baseline 


TT 


Power (W) 


= Achieve half the speed at 1/6 the power @0.6V Vpp 


= Use body bias to compensate for temperature and aging 
= Critical for low Урр operation 


= Enable workable design window — avoid overdesign 


— 
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This is HotChips — Go Faster! 


5.00Е-04 l 
‚ FF ? 
4.00Е-04 | e | DDC 
І І 
= Baseline TT @ І 
Š SS | 
1 | | 
5 2.00Е-04 | i 
t56% 
1.00Е-04 I 
n Silicon І 
0.00E«00 easurement | Fre uerk MHz) 
300 350 400 450 500 550 600 


= Turbo Mode: DDC achieves over 50% speedup @ 1.2V Урр 
= All corners for DDC run at 580MHz vs 370MHz for baseline 
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28пт апа Веуопа 


Baseline: SS, VBS = OV 
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--- -00С: 55, VBS = +0.3 V 
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№ 
о 
о 


—— DDC: SS, VBS=0V 


silicon calibrated 
PICE simulations) 


1 VDD [V] 


» Same performance at 0.75V Vp, as baseline at 0.9V Vpp 
= 30% lower power 


= Alternatively 25% faster at same voltage 


= Even better when using body bias to pull in corners 
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Applying DDC to Lower Variability in Mobile SOC 


GPU BT WIFI 
(1.5CHz) SU GPs 
Medi 
gos 
inen E 


= CEN а 


" CPU: oid thread a aa critical 
= Push frequency by temporarily raising voltage in turbo mode 
= DVFS with body biasing becomes DVBFS 


= GPU: High number of cores using small transistors 
" Less overdesign due to lower delay variability 
= Increase parallelism, lower voltage, body bias dynamically for more pixels/Watt 


= Lower frequency blocks 
= In addition to high V+ transistors also run at lower voltage and optimal body bias 


= Whole chip: Use body bias to adjust for manufacturing variation 
= Take advantage of improved memory and analog performance 


* Lowering variability while compatible with existing bulk planar silicon IP 
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Conclusions 


= Variability limits chips 
= DDC reduces random variability through its undoped channel 


= DDC's strong body factor can be used to fix systematic variation and 
compensate for temperature variation 


= DDC provides performance kicker from 90nm to 20nm 
= Straight forward integration into existing nodes 
= Compatible with existing bulk planar CMOS silicon IP 
= Use existing CAD flow 


= DDC brings back low power tools 
= Large range DVFS 
= Body biasing 
= Low voltage operation 


= Taking advantage of reduced variability DDC in design and 
architecture will lead to next level in mobile SOC 
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High performance and efficient single-chip 
small cell base station SoC 


Kin-Yip Liu 
Cavium, Inc. 
kliu@cavium.com 
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Presentation Overview 2 CAVIUM 


Base station processing overview 


Why small cells and heterogeneous Radio Access 
Network (RAN) 


Small cell design based on OCTEON Fusion 
OCTEON Fusion CNF71XX architecture 
CNF71XX design 

Software models 

Summary 
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І ТЕ Wireless Network Overview Z CAVIUM 


" [TE equipment: 
е Base Stations – eNodeB 


* User equipment (UE), e.g. cell 
phone, dongle for notebook PC 51 


* Core network — Evolved Packet 


Evolved packet core (еРС) 


Core (еРС) (< ) 
= Ап eNode interfaces with: Base Station 
. ePC (multiple nodes with different (eNodeB) 
functions) 
* Control, signaling РА ^S )) 
° То voice & data networks 
№ 
e UE’s 
. FAT 
* Neighbor eNodeB's Т 4 | 
| | Neighbor 
° Communicate load апа User equipment eNodeB 
interference info | (UE) 


e Handover UE's 
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LTE Protocols & Processing & cavum 


" eNodeB relays information between UE and ePC 
" eNodeB and UE communication protocol: 


Protocol layers | Processing functions 


RRC (layer 3) Set up and maintain radio bearers. Manage radio 
resources. Control functions. Handover decisions 


PDCP (layer 2) En/decrypt over-the-air traffic, Header de/compression 


RLC (layer 2) Segment and reassemble traffic. Ensure in-order traffic 
delivery. Re-transmit as needed 


MAC (layer 2) Schedule use of over-the-air resources. Select PHY 
configuration for transfers. Collect stats & report to RRC 


PHY (layer 1) Physical layer: OFDM for downlink. SC-FDMA for uplink 


= eNodeB and ePC communication protocol: 


* |P network, IPSec protected, GTP tunnels of user data in UDP/IP, 
SCTP for control traffic 
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Classes of Base Stations 2) CAVIUM 


Ноте Enterprise 
Femto Femto 
Cell Radius 250 -400m 2 - 20km 20km 


No. of users 8 32 128 1200 3600 


Peak data rate 50MbpsDL 100Mbps DL 150Mbps DL 300Mbps DL 900Mbps DL 
25Mbps UL 50Mbps UL 75Mbps UL 150Mbps UL 450Mbps UL 


User Mobility 4 km/hr 4 km/hr 50 km/hr 350 km/hr 350 km/hr 

Locations Home Office, school, Urban Urban, rural Metro, 
apartment hotspots, areas traditional 
buildings, rural areas approach 
malls 


DL - Downlink. Traffic going from network to user 
UL — Uplink. Traffic going from user to network 
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Additional Small Cell Requirements 2 cavium 


= WiFi option 
— Single platform for Small Cell + Access Point 
— SoC must provide performance headroom for both functions 
= Power-over-Ethernet 
— Simplify system deployment, but limited system power supply 
— SoC must consume very low power 
= Time synchronization 
— Mandatory for LTE base stations. IP backhaul, no TDM interface 
— GPS option. May not work well in-door 
— Software solutions: IEEE 1588 v2, NTP. In-door OK, cost effective 
= Security 
— Authenticated and encrypted software for secure boot 
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Why deploy small cells? = слалом 


ети for Hot spots апа Not spots 


Easing congestion New coverage 
Within macro coverage in addition to macro 


Small Cells essential for LTE 


coverage, capacity, and throughput 
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Current Generation Base Stations @cavium 


PHY (layer 1) | МАС, RLC, PDCP, ВВС, 
‚ Communicate w/ core network 


Macro 
OCTEON Multicore SoC 


DSP | | OCTEON Multicore SoC 


single-chip Multicore SoC for Layer 2 and above 


Small Cells 


processing. Common software from Small to Macro cells 
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Next Generation Base Stations Z CAVIUM 


PHY (layer 1) | MAC, ВЕС, PDCP, ВВС, 
: Communicate w/ core network 


OCTEON Multicore SoC 


Small Cells 


single-chip Multicore + baseband module SoC for Small 
Cells. Common software from Small to Macro cells 
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OCTEON Fusion based Small cell Z самом 


Dual band 802.11n/ ac 


IEEE 1588 v2, SyncE 


Backhaul GbE 


Management 


OCTEON 
Fusion 


ЖОШ | CNF71XX 


Small Cell Base Station + Access Point 
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ОСТЕОМ Ғивіоп СМЕ71ХХ 
Small cell BaseStation-on-a-chip Family 


64-bit + ECC DDR3 Т Re < " High Performance LTE / 3G 
шы — Secure boot Small Cell SoC Processors: 
USB 2.0 - 4 МІР564 cores up to 1.5 GHz 


Crypto | packet Crypto | packet cme - 6DSP cores up to 500MHz 
Security Security 


MIPS64 MIPS64 2x GigE — Many HW Accelerators for Packet 


CPU core CPU core SGMII Processing, LTE/3G, and Security 
(1588v2, 


SyncE) — IEEE 1588 v2, SyncE 
Authentik secure boot 


2x PCle = 
= Highly Scalable 
| Spanning 32 to 200+ Users 


2 CAVIUM 


zomauHaoo 


mauoo-armcz 


Packet Input - 3G апа LTE FDD & TDD 


& Output 


а kaa - Upto LTE 20MHz 150 Mbps Uplink 


Г | Acceleration Blocks Acceleration Blocks |7! Packet (UL) + 150Mbps Downlink (DL) 
% 1 | Order, QoS, 


Scheduling = Headroom for Unique Carrier 
{ | shared | | Timers Class Features 
memo š 
{ { ww Manager - Self Optimizing Networks 
— Interference Cancellation 


— Advanced Receivers 
| JESD 207P RFIC Interface 


LTE TDD/FDD, WCDMA (A) 
2x2 MIMO, Up to 20 MHz 


о2 > штор 0 


mrcoosz 
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Design Philosophy 2 CAVIUM 


ERA DANIAE * Power and area efficient CPU and DSP cores 
ШІ. fig see and * Scale performance with more cores 
ower Efficient * Not depend on very high frequency or core complexity 


Short Latencies * Shortest cache and memory latencies. Optimize for determinism 
Deterministic * Flexible prefetch, cache hints, options to cache packet headers only 
Performance * L2 way partition feature avoids cache pollution 


Optimized ISA * MIPS64 r3 instruction set + >80 OCTEON instructions 
Ease of prog ramming * Full C programming. Standard OS and development tools 


un ass pawa UU - TCP/IP, complete packet receive and transmit offload, packet 

Comprehensive Hardware ordering, QoS, work scheduling, buffer de/allocation, IPSec, wireless 
Acceleration crypto algorithms, timers, wireless baseband functions 

|” Crypto coprocessor in each core. Best latency & determinism 


SA Raa (AINA * Software compatible from 1-48 cores and across generations 
шү В e Single SDK to develop software for all OCTEONs 
оаатар * Software for macro base stations directly reusable for Small Cells 
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Baseband Module 2) CAVIUM 


Baseband module processing flows 


e Wireless UL and DL processing differ. Partition the DSP cores and assign 
relevant hardware accelerators for UL Vs. DL processing 
* Modular design with flexible partitioning simplifies software design 


6x DSP cores optimized for wireless baseband processing 


e 3-way VLIW, with 16x MAC or 4x complex MAC vector processing per cycle 
* Optimizing instructions for wireless baseband processing 

* Dual 128-bit load/store paths transfer up to two vector operands each cycle 
Hardware accelerators (HABs) 


* Comprehensive set of LTE and 3G, UL and DL relevant accelerators 
e Automate offload to accelerators with DMA engines and Sequencer 


Shared memory interconnect 


e DSPs and HABs can access any memory structure in entire baseband module 
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А Cluster of the Baseband Module 2слушм 


— sm s — 


HAB 
Memory 
ИДЕ <<———> Manager 


(ОМА 
engines) 


Control path 


Interrupts 


128-bit dual load/store paths enable VLIW DSP cores to fetch 


two 128-bit vector operands + processing in single cycle 
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CNF71XX Baseband Architecture zcavium 


ҮШ = 
“Оп саа | Interconnect 
nterface, DMA 
then L2/DRAM | timers, reset interface 
control, etc. JESD 207P 
<--------------------” 
Inter-cluster links enable 
<---- 


DSP cores апа HABs to 
access memory in other 
clusters 


Example processing model 
and flow of wireless data 


Shared memory interconnect enables flexibility in 


optimizing the processing models and flows 
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ОСТЕОМ Multicore Z САМОМ 


Wireless L2 & L3, Transport, Control, WiFi, Customer Apps 


е OCTEON Fusion = OCTEON Multicore + Baseband module 


* The OCTEON Multicore part of the SoC is the same architecture as OCTEON 
Multicore SoCs which have been widely deployed for designing base stations 


CPU cores 


* 4x OCTEON MIPS64 cores 

* Shortest L1 and last-level-cache (L2) latencies among multicore processors 
* Power optimizer™ per-core software controlled power reduction 

* Fine-grained clock gating 


Hardware accelerators 


* Comprehensive packet processing hardware: Headers parsing, classification, 
RED, QoS, buffer allocation, L4 checksums, traffic rate limiting & scheduling 
* Crypto, packet order, work scheduling, timers for TCP and RLC, RoHC 


Low latency interconnect 


• Split-transaction interconnects and L2 cache run at core frequency 
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OCTEON enhanced MIPS64 соге & cavum 


Custom designed efficient 64-bit CPU core 


* Dual-issue, 8+ stages. Optimized for perf/watt, perf/area 
e Short З cycles L1 cache load-to-use latency 
* MIPS64 r3 instruction set + >80 optimizing instructions 


Examples of optimizing instructions added 


“ Atomic memory ops (increment, add, fetch-and-add, etc.) 
* Insert/extract arbitrary bit fields within a word 

* Branch if certain bit field contains a set bit or not 

* Compare operands and set bitO for equal / not equal 

* Additional flavors of prefetch and cache hints 

* Population count 

* Unaligned load/store 
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ОСТЕОМ Cache Policies Z CAVIUM 


L1 <-> L2 Cache: Write-through 


* Excellent performance for networking and wireless 
applications 


* Minimal per-CPU-core cost (power, area) 

* Lowest possible read latencies 

• Allows many outstanding stores, optimizations 
e Automatic L1 error correction 


L2 Cache «-» DRAM: Write-back 


e Standard DDR3 DRAM DIMM's are highest performance 
with block transfers 

* Minimizes required DRAM bandwidth 

* Don't-write-back feature (e.g. for most of packet data) plus 
additional cache hints 
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СМЕ71ХХ Coherent Interconnect Z CAVIUM 


_ MIPS64 - 
Core 3 


© MIPS64 - 
Core 2 


| MIPS64 
Core 1 


| MIPS64 
Core 0 


64-bit CPU cores, split-transaction interconnect, 
L2 cache & controller all run at core frequency 
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СМЕ71ХХ Chip Floorplan 2 САМОМ 


о Baseband module: 


6x DSP cores 
HW accelerators 
Memory structures 


Shared memory 
interconnect 


RFIC interface 
Timers 
Interrupts & control 


4x MIPS64 Cores 
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1МВ 12 cache апа coherent 
memory interconnects 
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Packet/Data Flow: 
LTE Downlink (DL) Processing 


2 CAVIUM 


Communication between eNodeB апа еРС: 
1. ePC sends user packets to eNodeB over 


Buffer GTP-U tunnels. Packets arrive via GE port 
MIPS64 Manager 2. Packet Input hardware handles all 
ris Ethernet packet receive, parsing headers 
~~ = to identify flow for packet order and QoS, 
Scheduling 


allocating buffers, and DMAing packet 
Bridge |" eg | 
Packet 
Packet 


data to buffers in L2 cache/memory 
Output 


Interconnect 


3. MIPS64 cores and hardware accelerators 
terminate the packet data, including IPSec 
decrypt 


| Baseband module | 
| (DSPs and HABs) | 

Communication between eNodeB апа UE’s with 1ms ТТІ (transmission time interval): 

MIPS64 cores and accelerators process PDCP, RLC and MAC protocol layers. 

MAC layer processing schedules data and wireless PHY configuration for DL transmission 

Baseband hardware DMAs data from L2 cache to its local memory 

Downlink DSP cores and HABs complete DL processing and transmit data out via RF interface 


po 
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Packet/Data Flow: "C 
LTE Uplink (UL) Processing ы 


Communication between eNodeB апа еРС: 
1. MIPS64 cores and hardware accelerators 

MIPS64 package received UE data into IP packets 
2. Encrypt the ІР packets using IPSec 


CPU Packet 
core Order QoS 3. Send the packets to ePC via GTP-U 
Scheduling tunnels and via GE port 


Bridge |" eg | 
Packet 
Input 
Packet 


Interconnect 


Baseband module 
(DSPs and HABs) 


Communication between eNodeB and UE’s with 1ms TTI (transmission time interval): 
1. PHY baseband processes UL traffic and detects random access from UE’s 
2. PHY baseband DMAs processed UL data to L2 cache 


3. MIPS64 cores and accelerators process MAC, RLC, and PDCP layers to terminate received UE 
traffic into packets. 
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Mapping eNodeB to Multicore 


2 CAVIUM 
ө it ' . 
MIPS64 core 0 MIPS64 core 1 Example partitioning : LTE eNodeB AP 
С LOAM Data/Pack ° MAC and L1 driver on one core 
ontrol: ; ata/Packet: 
S1-AP, X2-AP, ° Easy to meet LTE 1ms ТТІ 
RRC, IEEE 1588 RLC, PDCP, GTP- 


и, IPSec 


Quick response to PHY interrupts 
RLC, PDCP, Transport on one core 


° Option to partition L2 cache to 


MIPS64 core 2 MIPS64 core 3 avoid cache pollution from control 
processing 
MAC, For customer «C | | 
Sp l а apps and/or ontrol processing on one core 
Driver WiFi ° 1 соге Нее 
° Headroom for WiFi and service 
provider applications 
OCTEON ° Small Cell Forum API compliant 
MIIPS64 cores 


Quad-core delivers required headroom and deterministic 


performance for real-time LTE and other processing 
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CNF71xx Complete End-to-end Validation Z cavium 


› 
› 


| Platform ready | STEP4 
^ 


STEP1 - PHY + Driver S/W 
STEP2 - PHY + Driver S/W 
STEP3 - L1 + L2 + L3 


+ PLT (Physical Layer Test) 
* Scheduler 


STEP4 - PHY + Modem + Radio 

STEP5 - Core network + Basestation (L2/L3 stacks, S1 I/F) 
STEP6 - IOT (Interoperability Testing) in PHY (PLT + Modem + Radio + UE L1) 
STEP7 - IOT in MAC (w/ UE L1/L2) 


STEPS — IOT in E2E (w/ UE 


over full protocol stacks) 


STEP9 - DL/UL Performance Measurements w/ UE 


STEP1 PHY Verification 
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-E 


Driver Verification 


STEP3 


Scheduler/L2 Verification 


adio Integration 


PHY IOT 


STEP6 


STEP5 


w/ core network, w/o radio) 


Hot Chips 24 


Performance 


STEP9 


STEP7 


MAC IOT 


End-to-End 
STEP8 
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Summary Z CAVIUM 


> ОСТЕОМ Fusion CNF71XX 


" High performance “base station on a chip” SoC 
• LTE 20MHz, 150Mbps DL + 150Mbps UL, 2x2 MIMO, 128 users 


= OCTEON Fusion = ОСТЕОМ multicore + baseband 


e Same OCTEON software for small to macro cells 
" End-to-end interoperability and performance verified 


» Optimized for Base station designs 


" Delivers deterministic real-time performance, low power, 
and high integration, with significant compute headroom 
* 4x enhanced & efficient 64-bit (OCTEON MIPS) CPU cores 
• бх Baseband optimized DSP vector processors 
e Many hardware accelerators 
* Optimized for short latencies and deterministic performance 
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Backup 2 CAVIUM 
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Cavium: Company Summary Z CAVIUM 


в Founded 2001 
< = NASDAQ ІРО (CAVM) 2007 


pow Locations: US, India, China, TW 


= 2011 Revenues : $259M, +26% YOY 
| = 5 year CAGR: ^50% 


" Profitable with Strong Financials, Zero Debt 


= Addressing Multi-billion dollar Networking, Communications, Storage and 
Digital Home markets. 


" MIPS64 and ARM based Multi-core Processor SoCs; Multi-core Search and 
Security Processors 


= All Top Networking, Wireless and Security Vendors use Cavium 
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Carriers coping with 1000х traffic 


А 2 САЛОМ 
increase and по extra reveñue 


Smart devices multiply traffic 


Smartphone 2 / = м x 24* 
/ / 
Handheld Gami = & 60* = 
andheld батіп = m x 
Console š ШЕН б с 
О Revenue/Sub. 
= 
= = 
Tablet = м x122* Ф 
Q 
um 
| Ф 
Mobile Phone = Ж — о 
Projector ES = | 
o 
m 
Laptop p — = г x 515% Network Cost/Sub. 
а” À} 


* Monthly basic mobile phone data traffic 


Source: Cisco VNI Mobile, 2011 


Heterogeneous Radio Access Network 
= Macro base stations are expensive (CAPEX and ОРЕХ) 


= Augment Macro with Small cell base stations to add capacity and 
coverage cost effectively 
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Previous Generation Base Stations 2) CAVIUM 


PHY  L2(MAC, | L3 IPv4/v6, 
| ВЕС) | GTP, PDCP 


Macro x x Control 
x x CPU 


Micro/Pico 


Before Multi-core SoCs became available, Base Station designs required 


many components, microcode programming on NPU, general purpose 
_ High 
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QUALCOMM 


FSM™ 
Femtocell Station Modem 


A Highly Integrated, Performance Driven 
Chipset for the Small Cell Market 


Luca Blessent 


[#ualcomm 


Notices кдй 


Copyright © 2012 QUALCOMM Incorporated. 
All rights reserved. 


QUALCOMM is a registered trademark of QUALCOMM Incorporated in the United States and may be 
registered in other countries. Other product and brand names may be trademarks or registered 
trademarks of their respective owners. 


Qualcomm reserves the right to make changes to the product(s) or information contained herein without 
notice. No liability is assumed for any damages arising directly or indirectly by their use or application. 
The information provided іп this document is provided оп ап “ав is” basis. 
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Outline я 


" Small Cells: Motivation and Implications 
= Cellular Access Point Evolution 

= Тһе FSM9xxx Chipset 

" Design Challenges 

= Selected Advanced Features 

= FSM9xxx Based Access Point 

= Power Consumption 

" Summary and Closing Remarks 
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@ualcomm 
ono ATHEROS 
Traditional Cellular + Small Cells New Cellular 
.. Coverage Model . Topology 


Limited Spectrum 
Improved User 
Experience 
Interference 


Management 


Low Cost 


Neighborhood 
Femtocells 
Low Power New Requirements for 
Cellular Access Points 


FSM9xxx a 


Quarconw 


Ultra Compact 


Small Cell Access Point Advanced 
Features 
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Cellular Access Point Evolution ATHEROS 
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ADVANCED 
FEATURES 


COST & SIZE 


MVM. 
1 4. А. 


FSM9xxx `` 


Qualcomnx 


[#ualcomm 
ATHEROS 


Тһе FSM9xxx SoC 


Chip Layout 


Key Stats 


Ц 45 nm 

Ц ~ 1.8 W for realistic full load 

О Sampling commercially 
since April 2011 


Key Features 


Small Cell Modem 
Integrated GPS 
Snapdragon™ Application 
Processor 

Security provisions 


Interference management 
Е — 


mE ЕОО 
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feuaicomm 


The FSM9xxx Architecture 


Hexagon!M Snapdragon™ 
Processor Processor 


| 


Interconnect Fabric 


Secure 
Boot 
= Processor 


Processors 


Snapdragon™ 
Processor 


Qualcomm’s 15! generation CPU, 
codenamed “Scorpion” 


1 GHz 

ARMv7 ISA 

~ 1.6х DMIPS/MHz w.r.t. АВМ11 
Optimized for low power 

Open processor 

Handles L3, OA&M, etc. 
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feuaccomm 
ATHEROS 


Нехадоп"М 
Processor 


Qualcomm's custom DSP 

600 MHz 

Multi-threaded 

Closed processor 

Handles L1 hardware control and 
L2 


Design Challenges ао 


= Need to combine base station and mobile functionality 
Downlink processing for neighbor discovery and self-configuration 
" Aggressive power consumption target 
< 5W for full solution 
= Stringent security requirements for residential deployment 
Requires on-chip trusted execution environment 
= Uncompromised modem performance 
Up to 16 Multi-RAB UMTS users 
28 Mb/s downlink throughput 
5.7 Mb/s uplink throughput 
Rx and Тх diversity 


= Support for advanced interference management features 


Additional processing chains for beaconing and uplink 
measurements 
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Additional processing chains: 

1) Downlink beaconing to facilitate 
system reselection 

2) Uplink mobile and interference 

sensing 


Advanced 
Signal Processing 


Simultaneous small cell service and 
downlink sniffing: 

1) Dynamic interference management 

2) Continuous VCTCXO disciplining 


DFE_AX2 


5x5 MUX 


ШЕН ИЕ 
4x5 MUX 


ОҒЕ RX3 


DFF RX4 


Small Cell Downlink 
Modem Heceiver 


"NU 


4x4 MUX 
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Security келат 


Snapdragon™ 
Processor 


| 


| 


Interconnact Fabric 


Secure Region for 

Trusted Execution Secure 

Environment Ом Boot 
Processor 


Secure Boot Procedure 
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[#ualcomm 
ATHEROS 


The FSM9xxx Chipset 


FTR8700 Transceiver 


2x2 wideband (25 
MHz) chains 


Global UMTS and 
FSM9xxx Baseband CDMA2000 bands 


Processor FSM9xxx 


RTR8605 Receiver 


О FSM92xx SKUs for UMTS 
Ц FSM98xx SKUs for 
CDMA2000 


Downlink receiver 
GPS receiver 


QUALCOMM 


iue ' Power Management IC 


Ц Voltage regulators 
Ч System clocks 
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FSM9xxx Based АР: Functional View [= часот 


Е5М92хх 


к 
0 » 
Ethernet PHY T 
(100 К 
Antenna 


DDR2 SDRAM 
1Gb x 16 
(128 MB) 


DDR2 SDRAM 
1Gb x 16 
(128 MB) 


NAND Flash 
2Gb x 8 
(256 MB) 


AC/DC 
Wall 
Adaptar устсхо 
19.2 MHz 
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FSM9xxx Based АР: Implementation клас: 


2.5 in. х 2.5 in., 6-layer Board 
Small Cell Access Point Implementation 


5 
I 


“тет! 
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ATHEROS 


Power Consumption 


Test Configuration: 


Ethernet 
FTR8700 6.8% 
12.2% 


• 8 user residential femtocell 

(FSM9208) 
* HSDPA + EUL operation 
• 1.9 GHz band 
• 13 dBm maximum Tx power 
* Single Tx/Rx 
e GPS and downlink receiver 
FSM9208 active 

38.096 

* Measurements at room 

temperature 


Total AP Power: 4.8 W 
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Summary and Closing Remarks 


= Data demand, capacity limits and economics are driving 
operators towards small cells 


= Small cells deployment models create new opportunities 
and introduce new design challenges 


= The FSM SoC provides a set of advanced features for 
Improved system performance 


= This SoC enables a very compact, low power small cell AP 
design 


= The FSM9xxx chipset is Qualcomm s 1% generation small 
cell solution, focused on 3G 


= This chipset is part of a portfolio of solutions that will 
include LTE, integrated Wi-Fi, and small cells evolution 
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_ QUALCOMM 


Thank You! 


Rumi Zahir 
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Outline 


> Low power platform progression 
> Medfield platform for the Smartphone form-factor 
* Constraints, Ingredients, Package 
> Penwell SOC 
* Block Diagram 
* Intel Atom™ CPU power management 
e SOC power management 
* Power management software architecture 
> Medfield reference platform 


» emartphone roadmap 


Mobile and Communications Group ( intel) 


Low Power Platform Progression 


Moorestown (45nm) Medfield (32nm) 


3131 321345 51008 


04 


6 H ROEGLIN 


Board size 5,000mm2 4,150mm2 (4 17%) 
Standby 21mW 14mW (4 33%) 
power 
Browsing 1.2W 0.85W (4 29%) 
power 
Video + 720р encode + 1080p encode 
Camera 5 mega-pixel up to 16 mega-pixel 

| (T 
Graphics 800 MPPS 2,000 MPPS 250%) 


Mobile and Communications Group (intel) 


Crypto Audio 


Storage cpy 


Graphics ` Video 


Display 


Intel Form Factor Board Design 
Reference Design 


System-on-A-Chip 
Integration 


Design to meet Smartphone cost/power/pertormance requirements 


WIFI/BT Headset | 
Antenna Jack 


_ Diversity ` 
Antenna 


"ADM ET 
_ Connector. SIM 


Connector. 


Connector 


WIFI 


Power 
Module 


eMMC Delivery. 


Penwell қ == 
1 Main Antenna 
SOG мы ea 


coy 


Connector 


Penwell SoC Package Size 


Package-on-Package (POP) 
• 12x 12 mm PoP FCMB4 — 32nm 
* Non PoP SoC < 0.8 mm 


e PoP z height < 1.4mm 
“ OEM/ODM can solder up to 2 GB of 
LPDDR2 memory on top of SOC 


• Memory Peak Bandwidth densities 

У 6.4GB/s @ 800MT/s * Other Features 

У Channels and ranks У Aggressive power management to reduce 
“ Dual 32 bit channels power consumpion 


У Supports 1 ог 2 ranks per channel í . closing policies to close 
: : unu 
* Memory Size ana Honey v Supports different physical mappings of 
v Supports total memory size of 128MB, bank addresses to optimize for 
256MB, 512MB and 1GB per channel performance 


У Supports 1Gb, 2Gb and 4Gb chip 


Mobile and Communications Group (intel) 


Security. Power Low Power 
Engine | Manager Audio 


Storage 


MIPI-CSI : 
2D/3D Video 
Graphics Encode/Decode 
(1080p30) 


Image Signal 2150 


Ргосеѕѕог 


Controller 
(3 pipes) 


MIPI-CSI 


Penwell SOC (Intel Hi-K 32nm Process Technology) 


Penwell CPU Dynamic Range 


Core Freq up to 1.6 
GHZ 
— for bursty workloads 


Core Freq = 1.3 
GHz 


= Power: ~500mW 50mW Fixed Workload 
< Соге Егеа = 600МН 
> Power: ~175mW _ 
5 Š 
O Burst 5 
LL а. 
tc 
LLI 
> Time 
Core Freq = І 
MM аны Fine-grained power Bursty Workload 


management through dynamic 
voltage & frequency scaling 


Power 


POWER 


Power assumptions: Tj=70C. Steady State Worst Case ST App Power . 
Projected on Intel 32nm process Time 


Core/L2$ Power Is -Zero 
CPU State Saved in SRA 
«100 uS Exit Latency 


Wide Dynamic Range & Fast Exit Latencies = Big Energy Savings 


Mobile and Communications Group (intel) 


Browser Results Summary 


PO 600 MHz 900 MHz 1,500 MHz 


Frequency 1x 1.5х 2.5X 
Performance 1x Салах 224 2 


Power 1x 1.29x 1.81х 


в 


"Race to Idle” at higher frequency uses more power, but is lower energy 


Mobile and Communications Group ( intel) 


Power C-States 
COHFM COLFM  C1/C2 C4 
Core voltage 
Core clock 
PLL 


L1 caches 
L2 cache 


Wakeup time | active 


The OS Is Responsible For Identifying When The Processor Needs To Be In A 
Certain С State Апа Requests The Processor To Enter That State 


Mobile and Communications Group ( intel) 


5011 
* Used during idle 


(e.g. home screen, web 


browsing) 
• Ultra Low Power: 


mW 


e Entry-Exit Latency: us 


$013 / S3 


* Used when NOT interacting with 
the device (e.g. standby mode) 


e SoC power: uW 


e Entry-Exit Latency: ms 


Video 
Encode 


Display 
Controller 


Image 
Signal 
Processor 


РРР) РРР Р 


Ромег 
Manager 


2D/3D 
Graphics 


Low Power 
Audio 


Video 
Encode/Decode 
(1080p30) 


Display 
Controller 
(3 pipes) 


pe 


Power Low Power 
Manager Audio 


2D/3D 


Graphics Encode/Decode 


(1080p30) 


Image Signal 
Processor 


Security 
Engine 


Power 
Manager 


Image Signal 
Processor 


Low Power 
Audio 


New OS Power Management (OSPM) 


* Pervasive Power Management 45 Powar [serie 
Y Integrated PMU 
У Dedicated Power Delivery ІС 


У Active management through HW, ( 
FW, SW ч ' Displa 
e Software-Directed 
v Operating system power 
management 22. 
У Manages all hardware capabilities Image Signal 


[UCESSUOI 


| | 
e Fine Grain Power Management - 
У 13 rails for IO & logic voltages 
У 45 Power islands for sub-systems \ 
У Aggressive power and clock gating — 
У Integrated clocks апа ҮН power 
down 


OSPM Directs Entire Platform to Lowest Power State 


Mobile and Communications Group ( intel) 


Platform Power Management Architecture 


Android Power Manager P User Level 


Ü Kernel Level 


B Firmware 


PMU Idle handler 


PMU Interface 


Memory Mapped Local Power 
Registers Management 
Dev freq 


Idle Handler 


PMU 
Firmware 
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Medfield Reference Platform 


18 


High Performance CPU 


1.6 Ghz Intel® Atom™ Processor 72460 


Full HD Video 


1080p, 30fps Video Encoding 
1080p, 30fps Video Playback 


Advanced Imaging 
Intel Image Signal Processing (ISP) 
Advanced UI/UX from Intel 


Great Graphics 
PowerVR SGX 540 @ 400 MHz 
High Speed 


Connectivity 
Apps 6260 21/5.8Mbps HSPA+ 


Google* Play (Android* Apps) 


* other brands and names may be claimed as the property of others 
** Battery: 1500mA, 3.7V 


— 


Current Design Wins 
Lava XOLO X900 in India 

Lenovo K900 in China 

Orange San Diego in UK 

Orange with Intel Inside® in France. 


High Resolution 
Display 


Internal : 1024x600;1024x768p capable 


Optimized Ано Support 


Customizable User Experience 


Enhanced Power/Batterylife 
Standby: 14 days** 

Video (1080p): 6 hours 

Browsing 3G: 5 hours 

Voice Call: 8 hours 


Security 


Programmable Security Engine 
Remote Management Features 


Operating System 
Android 2.3.7 (Gingerbread) 
Android 4.0.4 (Ice Cream Sandwich) 


Mobile and Communications Group (intel) 


5тагірһопе Platform Roadmap 


Higher 


TM 
Performance Intel® Atom™ 72580 


Ор to 2X Performance 


PERFORMANCE 2X HSPA+ / LTE 
SMARTPHONE 
Intel® Atom™ 72460 
Up to 1.6 GHz 
Processor 
Intel XMM™ 6260: 
HSPA+ 
VALUE 
SMARTPHON 
Fo GHz Processor 


Lower Cost Intel XMM 6265: HSPA+ 


Mobile and Communications Group 


Medfield Summary 


e Medfield meets tight Smartphone 
power consumption constraints and 
provides outstanding scalar CPU 
performance 


Y “Race to Idle” minimizes energy Г 
consumption while providing excellent end- 
user experience | 


Y Ultra low power SOC states cater to 
common “user idle” and “system idle” 
scenarios 

У Accelerators for Video, Camera, Audio 
processing provide energy optimized 
media capabilities 


Mobile and Communications Group (intel) 


Рас Gelsinger 
President & СОО 
Information Infrastructure Products 
EMC Corporation 
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Waves Of Change In IT 


Cloud 


Networked/ Computing 
Distributed 
Computing 


PC/ 
Microprocessor 
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Phases Of IT Maturity 


Packaged Applications Service Catalog New Business Applications 
Flat IT Tax, Project-centric Cost & Use Metrics Pay-For-Use 
Dedicated Vertical Stacks Dynamic Resource Pools Automated Infrastructure 


Reactive Proactive Innovative 
Respond To Business Request Increase IT Agility Differentiate The Business 
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Steps Of IT Transformation 


Cloud Applications 
Build Mobile & Predictive App 


Cloud Operating Model 


Standardize & Automate To Run IT-As-A-Service 


Cloud Infrastructure 
Build A Software-Defined Data Center 


Reactive Proactive Innovative 
Respond To Business Request Increase IT Agility Differentiate The Business 
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intel 


Xeon® 5600 
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Flash Fills The Performance Сар 


Ops / Second 


1,000,000,000's 
100,000,000's 
100,000's 


100's 


Latency 
Picoseconds Nanoseconds Microseconds  Milliseconds Seconds 


EMC 


© Copyright 2012 ЕМС Corporation. АП rights reserved. 7 


Dramatic Performance Growth For x86 


2000% Performance Increase Since 2005 


2005 2005 2006 2007 2008 2010 2011 


Source: Intel internal OLTP database workload performance estimates as of 15 April 2011. Results һауе been estimated 
based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or 


software design or configuration may affect actual performance. 
ЕМС 
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Dominant Market Share For x86 


х86 As А Percent Of Worldwide Server Shipments 


100% 
90% 
80% 
70% 
60% 
50% 
40% 
30% 
20% 
10% 

0% 
1989 1995 2000 2005 2010 


== X86 Unit Share ==x86 Rev. Share 


Source: IDC 
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Waves of Change 


сы 


MOBILE , | CLOUD 


y э, 


| BIG DATA 


Net Revenue From CPUs & Chipsets+ 


Massive Transition 
in a Decade 


35,000 
2015 — 
30,000 | 
25,000 
20,000 


15,000 


10,000 2005. 


Revenue (іп $1 Million USD) 


2011 
5,000 x 


2001 


0 


Other Mobile x86 Mobile x86 Desktop x86 Servers 


+: Data from Intel 2011, 2005, 2001 Annual Reports, http:// www.isuppli.com, http:// www2.uta.edu/marketing 
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More than Power, Performance, Cost and Footprint 


Embed HW in SoC for: 
Virtualization 


Graphics: Remote desktop / 
graphics-rich remote UI 


Security: e-Currency, DRM, 
Anti-Virus 


Encryption: Fast, secure end- 
point communication 


Standard HW interface 
for generic OS / SW 
management 
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Моге than Power, Performance, Cost апа Footprint 


Design for the Software- 
Defined Data Center & 
Big Data 


Server NICs integrate VXLAN 
VTEP 


HW accelerates remote 
graphics-rich desktops and 
connection protocols 


Programmable HW to classify 
and inspect network packets 


Large on-chip, high-speed 
memory (SRAM, PCM, Flash) 
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2009: More Apps On Virtual Infrastructure 
17,500,000 
15,000,000 


Virtual 


10,000,000 № Hosts 


12,500,000 Тһе Tipping Point r Ші Machines 
Physical 
7,500,000 
5,000,000 | 
2,500,000 ' 
"EL 


2005 2006 2007 2008 2009 2010 2011 2012 2013 


Source: IDC 
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2008 2012 FUTURE 


sh. V, f 
«ТУ 
АМА. 
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SOFTWARE-DEFINED 
DATACENTER 


АП infrastructure is virtualized and 
delivered as a service, and the control of 
this datacenter is entirely automated by 
software. 
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Traditional View of the DC Environment 
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MGMT  STORAGE/AVAILABILITY 


UTOMATE. 


| Mission 


| Critical 
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Old World: Static Security 


Static Attacks 


Generic, Systems-Based 


Static Infrastructure 
Physical, IT Controlled 


Static (Bolt-On) Defenses 


Signature-Based, At Perimeter 
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New World: Dynamic Security 


K Dynamic Attacks 
Cloud Targeted, Human-Based 


= $a Dynamic Infrastructure 


Virtual, User-Centric 


n Dynamic (Built-In) Defenses 


5 * Analytics & Risk-Based 
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BIG DATA 
TRANSFORMS 
BUSINESS 


IN 2000 THE WORLD GENERATED 


/`. ТМО. ЕХАВҮТЕЅ 
22%, OF NEW INFORMATION 


2044 
IN-2090 THE WORLD GENERATED 


PONO. EXABYTES 
ү qu^ ‚ OF NEW INFORMATION 


'EVE RY DAY 


How Companies Are Using Big Data 


Functional Areas Where Companies Are Using Big Data 


Customer Intimacy 
Budgeting & Planning 
Operations & Supply Chain 
Customer Service 
Performance Management 
New Product Strategy 


Pricing 


0 10 20 30 40 50 


McKinsey Global Survey of 1,469 C-level executive respondents at a range of industries and company sizes, “Minding Your Digital Business,” 2012. 


ЕМС. 
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bigedata N datasets so large 
they break traditional IT 
infrastructures. 


Data Science applies advanced 
analytical tools and algorithms 
to generate predictive insights 
and new product innovations 
that are a direct result of the data 


BI focuses on managing | 
апа reporting on existing 
data to monitor and 
manage concerns within 
the enterprise 


Who Is The Data Scientist? 


Source: EMC Study, “Data Science Revealed: А Data-Driven Glimpse into the Burgeoning New Field,” December 5, 2011 
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Training Tomorrow’s Talent 
EMC Academic Alliance 


practice of analytics, including a Data 
Analytics Lifecycle to address business 
challenges 


EMC Data Science 
Associate Certification 
= (ЕМСО5А) » 


PROFESSIONAL 
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° Silicon design remains 


essential - HW/SW co- 
design is critical 


e The action is т the 


edges (Mobile & Server) 


° Cloud becomes the 


Software-Defined 
Datacenter 


* Big Data opens up new 


opportunities for HW 
design 


POWER7+IM 


SUR: 


$ 
1 
I 
i 
3 
з, 

ж 


= 


IBM 


| Systems & Technology Group 


Outline 


> POWER Processor History 
> Design Overview 
> Performance Benchmarks 


> Key Features 
= Scale-up / Scale-out 
= Тһе new accelerators 
= Advanced energy management 


> Summary 


Statements regarding Power7+™ features do not imply that IBM will introduce a system with this capability 


August 29%, 2012 


IBM 


| Systems & Technology Group 


20+ Years of POWER Processors 


Н5641У Sstar 


RS64III Pulsar 


RS6411 North Star 


RS641 Apache 
BiCMOS 0. 


Muskie A35 WSQ 5um 


-Cobra A10 oo HH 
ER мы 
| POWER3™ 
-630 
оша 
0.72um POWER2™ 
E P2SC 
RSC 0 


3 IBM 


POWER5™ 
-SMT 


ii 1 te Uh - i 
POWER7+™ 
accelerators 


POWER7™ 


-Multi-core 


POWER6™ HEB 
-Ultra High Frequency 


Major POWER@ Innovation 

-2001 Dual Core Processors 

-2001 Large System Scaling 

-2001 Shared Caches 

-2003 On Chip Memory Control 
-2003 SMT 

-2006 Ultra High Frequency 

-2006 Dual Scope Coherence Mgmt 
-2006 Decimal Float/VMX 

-2006 Processor Hecovery/Sparing 
-2010 Balanced Multi-core Processor 
-2010 On Chip EDRAM 

-2012 On chip Accelerators 

-2012 Massive L3 cache 

-2012 Power Gating 


* Dates represent approximate processor power-on dates, not system availability 
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> Area: 567mm2 


> s L processor cores 
12 execution units per core 
4 Way SMT per core 
32 Threads per chip 
= 256KB L2 per core 


> Scalability up to 32 Sockets 
360GB/s SMP bandwidth/chip 
= 20,000 coherent operations in flight 


> Technology: 32nm lithography, Cu, SOI, 
eDRAM, 13 metal levels 
> 2.1B transistors 
Equivalent function of 5.4B 
> 80MB on chip eDRAM shared L3 
> Accelerators 
> Enhanced Power management 
> Binary Compatibility with POWER6/7 


Up to 25% frequency gain due to 
mapping into 32nm technology and 
power management improvements. 


Increase of L3 memory capacity by 2.5x 


Doubled single precision floating-point 
performance 


Added Power Gating regions for Core/L2 
& L3 regions 


Core/L2 Power-Gating [Ж 
L3 Power-Gating 
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Optimized in Two Dimensions 


РОМЕН?» | 


> Thread Strength 


> Thread Strength 
SMP Scaling 


Single Chip 
_ Module _ 


POWER7+ | 


РОМЕН? ` 


ket > Throughput 


> Bandwidth 
Amplifier 


Single Chip j 
Module Dual Chip 
Module 
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Performance View 


Normalized POWER7 vs POWER7+ Comparison 
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POWER7+ Accelerators 


> Provide CPU off-load and workload speedup for SSL, encrypted file system, 
and active memory expansion (АМЕ). 

= Asymmetric Math Functions (AMF) 
e RSA cryptography 
e ECC (elliptic curve cryptography) 

= Advanced Encryption Standard (AES)/Secure Hash Algorithm (SHA) 
e Symmetric-key cryptography with combinational modes 

= Random Number Generator (RNG) – True hardware entropy generator 
* Cannot be algorithmically reverse engineered 

= 842 proprietary compression algorithm 
* High bandwidth, area efficient 


> Integrated across silicon, ISA, hypervisor, and OS 
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Accelerators Details 


> Advanced Encryption Standard engine 


= A Modes: ECB, CBC, CTR, CCM, CCA, GCM, GCA, GMAC, 
CM, F8, XBC-MAC-96 


* Key lengths: 128b,192b, 256b 
и Three engines 


> Secure Hash Algorithm engine 


* Modes: SHA-1, SHA-256, SHA-512, MD5 
в НМАС supported for SHA 
н Three engines 


> Asymmetric Math Functions 


= Modular math functions for RSA (Rivest, Shamir, Adleman) 
and ECC (elliptic curve cryptography): mod add, mod 
subtract, mod inverse, mod reduction, mod multiplication, mod 
exponentiation, mod exponentiation CRT (integer only) 


= Point functions for ECC GF(p) and GF (2m): point add, point 
double, point multiply 


* RSA lengths: 512b,1024b, 2048b, 4096b 
= — ECC GF(p) lengths: 192b, 224b, 256b, 384b, 521b (SuiteB) 
= — ECC GF(2m) lengths: 163b, 233b,283b, 409b, 571b (SuiteB) 


» Random Number Generator 


* All digital СЕЗІП which produces 646 random numbers 
accessible by MMIO load instructions 


= Correctness verified against the NIST Random Number 
Generator Test Suite 


» Active Memory Expansion 


7 IBM-proprietary algorithm with 8B-, 4B-, and 2B-phrase 
parsings 


= Throughput: Up to 8 bytes of compression or 8 bytes of 
decompression per bus cycle. 


> MCD 


н Hardware to predict whether memory access is on-node or 
off-node. 


SMP Interconnect 


Common zx z MCD 
Queue = 


Interrupt 
Controller RNG 


DMA ingress cs 


cache i 
Engine Mer LE 
8 channels T 


гс I/O Buffers 
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POWER7- Sleep & Winkle Overview 


Core + L2 Stop clocks to only processor 
core execution engines. 
Nap Leave all caches running. 
(per core) Saves ~ 10% power with ^ Биз Latency 
с Core + L2 
Running 
Core 
Chiplet 
Power OFF the core plus private L2 cache. 
| Core +12 Requires restore/re-init to wakeup. 
Leave shared L3 cache running. 
Sleep Saves ~ 80% power with ~3ms Latency 
(per core) L3 cache 
Core + L2 Power OFF the entire chiplet. 
А Requires restore/re-init to wakeup. 
Winkle Takes offline 1/8 of the shared L3 cache. 
(per chiplet) RARES Saves > 95% power with < 6ms Latency 


IBM August 29%, 2012 
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Save Energy When Idle 


Three idle states were implemented to optimize power vs. latency 


Nap (Continued POWER7 support) 
Optimized for wake-up time Processor Idle States 


Turn off clocks to execution units only a 

Caches remain coherent . Н Е 

Winkle | > 

Ф 25 2 

р 9 

Sleep (Improved from POWER7) [i Е 

Моге savings at increased latency / | с. 

Ригде апа power off core plus 12 caches / : yi 

Leave shared L3 cache running Sleep: 3 : = 

Power7 adi Ё, е i 3 2 

Winkle (New for POWER7+) "NV и 1 

Maximum savings at higher latency 00077, | | E 

Purge and power off entire chiplet | 0.005 Е 
Takes eighth of chip L3 cache offline 10% 35% 80% >95% 


Processor Energy Reduction 
(compared to Idle Loop) 
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Real Time Chip Guardband 


> Conventional guardband 
"Static, conservative voltage margins 
for potential worst-case conditions 


*Causes unnecessary loss of energy 
efficiency during typical server usage 


» Critical Path Monitor (CPM) 


*Real Time detection of available 
circuit timing margin 


Representative 
Critical path 


Edge detector 
(tunable) 


Insertion 
Delay 
(calibration) 


Location of edge indicates margin 


IBM 


Components of Chip Guardband 


Static Guardband 


В Uncertainity (margin) 


Ш Test inaccuracy 
В Reliability(wearout) № Voltage variation 


B Thermal variation 
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Real Time Guardband — DPLL/CPM feedback loop 


Per Core - Clock Distribution 


Reference Clock Buffering 


Sensor 


CPM data encoding Aggregator 
“11111” = large margin 

“11110” = some margin 

“11100” = ideal margin 

“11000” = margin too small 


“10000” = not enough margin CPM Fmax Delay = КТІ, V) 
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CPMs аге strategically placed in known hot spots typically 
near micro-architecture critical paths. 


The real time feedback from CPMs can reduce how much 
margin is needed for various guardband components. 


Real-time guardbanding will allow for greater energy efficiency. 


Components of Chip Guardband 


ADN Reclaimed 
а Магдіп 


Мо Сһапде 


по СРМ with СРМ 


- — BE 1и E ш Test inaccuracy и Uncertainity (margin) 
e = CPM (Critical Path Monitor) Е Reliability(wearout) № Voltage variation 
L] = AND Buffer E Thermal variation CPM inaccuracy 
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Advanced 32nm Technology 
> 32nm High-K Metal Gate (HKMG) SOI based logic technology 
= 3 logic transistor threshold voltages (V,) optimizes power/performance 
=13-layer BEOL metal stack minimizes cross die latency 
*1x, 2x, 4x, 8x, & ultra-thick metal layers 


*eDRAM provides 3-4x density advantage over SRAM 
> Advanced features make the node effectively perform as one with sub-32nm features 


EDRAM cell 


EDRAM Density Gain 


= 32nm w/ 
eDRAM 


s 32nm w/ 
SRAM only 


Die Size 
August 29'^, 2012 


> Brings significant improvement to both scale ир & scale out systems. 
> The new accelerators optimize specific functions while offloading CPU. 


> Advanced energy management greatly improves data center efficiency. 
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Legal Disclaimer - Notice 


a INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO 
ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH 
PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL 
PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR А PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, 
COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. 


UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF 
THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. 


Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely оп the absence or characteristics of any 
features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or 
incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. 


The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published 
specifications. Current characterized errata are available on request. 


Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. 


Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go 
to: http://www.intel.com/design/literature.htm 


a Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor 
families: Со to: http://www.intel.com/products/processor number 


a Intel® AES-NI requires a computer system with an AES-NI enabled processor, as well as non-Intel software to execute the instructions in the correct sequence. AES-NI 
is available on select Intel® processors. For availability, consult your reseller or system manufacturer. For more information, see http://software.intel.com/en- 
us/articles/intel-advanced-encryption-standard-instructions-aes-ni/ 


a Мо computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer with Intel® 
Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-compatible measured launched environment 
(МГЕ). Intel TXT also requires the system to contain a TPM vi.s. For more information, visit http://www.intel.com/technology/securit 


a Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, and virtual machine monitor (VMM). Functionality, performance 
or other benefits will vary depending on hardware and software configurations. Software applications may not be compatible with all operating systems. Consult your 


PC manufacturer. For more information, visit http://www.intel.com/go/virtualization 


а Requires a system with Intel® Turbo Boost Technology. Intel Turbo Boost Technology and Intel Turbo Boost Technology 2.0 are only available on select Intel® 
processors. Consult your PC manufacturer. Performance varies depending on hardware, software, and system configuration. For more information, visit 
http://www.intel.com/go/turbo 


a Intel product is manufactured on a lead-free process. Lead is below 1000 PPM per EU RoHS directive (2002/95/EC, Annex A). No exemptions required 
a  Halogen-free: Applies only to halogenated flame retardants апа PVC in components. Halogens are below 900ppm bromine and 900ppm chlorine. 


a Copyright O 2012 Intel Corporation. All rights reserved. Intel, Intel Xeon, the Intel Xeon logo and the Intel logo are trademarks of Intel Corporation in the U.S. and/or 
other countries. . 
*Other names and brands may be claimed as the property of others. 
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Legal Disclaimers - Performance 


a Software and workloads used in performance tests may have been optimized for performance only on Intel 
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, 
components, software, operations and functions. Any change to апу of those factors may cause the results to vary. You 
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, 
including the performance of that product when combined with other products. 


a Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this 
document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance 
benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of 
systems available for purchase. 


a Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the 
actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other 
platforms, and assigning them a relative performance number that correlates with the performance improvements 
reported. 


а SPEC, SPECint, SPECfp, SPECrate, SPECpower_ssj, SPECjAppServer, SPECjEnterprise, SPECjbb, SPECompM, SPECompL, 
and SPEC MPI are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more 
information. 


а TPC Benchmark is a trademark of the Transaction Processing Council. See http://www.tpc.org for more information. 


a SAP and SAP NetWeaver are the registered trademarks of SAP AG in Germany and in several other countries. See 


http://www.sap.com/benchmark for more information. 
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Optimization Notice 


Optimization Notice 


Intel® compilers, associated libraries and associated development tools may include or utilize 
options that optimize for instruction sets that are available in both Intel® and non-Intel 
microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel 
microprocessors. In addition, certain compiler options for Intel compilers, including some that 
are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a 
detailed description of Intel compiler options, including the instruction sets and specific 
microprocessors they implicate, please refer to the “Intel® Compiler User and Reference Guides” 
under “Compiler Options." Many library routines that are part of Intel® compiler products аге 
more highly optimized for Intel microprocessors than for other microprocessors. While the 
compilers and libraries in Intel® compiler products offer optimizations for both Intel and Intel- 
compatible microprocessors, depending on the options you select, your code and other factors, 
you likely will get extra performance on Intel microprocessors. 


Intel® compilers, associated libraries and associated development tools may or may not optimize 
to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel 
microprocessors. These optimizations include Intel® Streaming SIMD Extensions 2 (Intel® 
SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and Supplemental Streaming SIMD 


Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not guarantee 
the availability, functionality, or effectiveness of any optimization on microprocessors not 
manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for 
use with Intel microprocessors. 


While Intel believes our compilers and libraries are excellent choices to assist іп obtaining the 
best performance on Intel® and non-Intel microprocessors, Intel recommends that you evaluate 
other compilers and libraries to determine which best meet your requirements. We hope to win 
your business by striving to offer the best performance of any compiler or library; please let us 
know if you find we do not. 

Notice revision #20101101 


епсу, and Performance 


L 


E5-2600 Product Family Architecture, Pc 
7 г 


r 


Effick 


Intel® Xeon® Proce 


Foundations of SNB-EP Performance 


O 


Foundations of SNB-EP Performance 


Peak Ring BW Math 
bytes data bus 


directions 


active stops 
= GBy/s 


Intel® Хеоп® Processor E5-2600 Product Family Architecture, Power Efficiency, and Performance 


Foundations of SNB-EP Performance 
Add an LLC, System Agents, and Power Management 


Integrated PCIe 
I/O __ GBy/sec 


__ GBy/sec 
— | 


Unit 
Peak Rates 
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Thurley Platform Review 
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Foundations of SNB-EP Performance 


Romley 


Thurley 


SNBEP € мв ЕР 
CPU-I/O CPU-I/O 
Caching Agent Caching Agent 


Topology Performance Changes 

* 40 Lanes of 8 GT/s Integrated PCIe 
e Dual Inter-processor ОРТ links 

* Four higher speed memory channels 
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Foundations of SNB-EP Performance 
Focus on 1/О Performance 


оРСТе G3: 8 GT/s vs. 4 GT/s 

о DMI2 (4 GT/s) vs. DMI1 (2 GT/S) (not shown in diagram) 
01/0 capacity scales with sockets (memory BW) 
alnherent benefit from Integration: 

ОРТ link to I/O controller replaced with direct ring 

interconnect reducing latency and increasing BW 
a CPUs and PCIe are a unified Caching Agent 

o Less resource partitioning 

> More scalable, higher performance 
o Reduces the latency of cacheable traffic 
о PCIe acts under the auspices of and uses the LLC (more later) 
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Foundations of SNB-EP Performance 
Focus on I/O Performance (cont'd) 


al/O-related Optimizations 
o Double width data buses in the I/O unit 
o ReadCurrent semantics rather the Code Read 
> Potentially reduces memory write traffic - maybe a lot 
o Inbound writes 
> Cache line pre-allocated but ownership can be preempted 
> Prefetch of data (for write merging) 
040 lanes vs. 36 lanes 


a Physical address range (46b vs. 41b) 
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Foundations of SNB-EP Performance 
Focus on I/O Performance (cont'd) 


oIntel® Direct Data I/O Technology мее рото): 
IIO allocates and transfers directly into LLC 
o IIO cache allocating is generally limited to 2 (of 20) ways 
> Can use a line that’s already been allocated by, say, a core 


o Circular buffers of reasonable size (a few to ten MBy) can 
reside in the LLC and, in practice, almost never be written. 


o Making use of this can effectively double the achievable I/O 
bandwidth of a core and of a socket. 

o Permits practically linear scaling as multiple high bandwidth 
I/O devices are added (e.g., 10 GbE adapters) with 
achieving nearly zero read and write bandwidth to memory 


> Saves power, too 
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Mid-Game Summary 


almproved performance by improving the parts 
o Sandy Bridge core 
o On-die interconnect (“Ring”) 
o More and faster memory channels with improved scheduling 
o Faster inter-socket communication (Intel® QPI) 
o Integrating and accelerating I/O 


aComing Up in the Next Half: 


Performance with Power Efficiency 


ІСІ i Performance: 
Energy Efficient Load Line 4. 
Server Platform Power versus Workload 


300 
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e Platform efficiency at low Power . Processor Power 
• CPU and DRAM VR Phase shedding 


* Scalable Uncore Power "ES БАЗ ЛЕ 
• Uncore voltage frequency scaling 
• 1/0 Power management 
• Scalable Memory Power 
- Multi-rank slow СКЕ QPI Ор/ 1, PCIE ASPM L1 


Significant Improvement to Proportional Energy 


Dynamic Performance Load Line Balanced 


Policy 
PCU dynamically adjusts to OS Power жадылы 
Management Policy System Power vs. Throughput Мнн 
OS communicates Policy through EPB 
(Energy Perf BIAS) 


Регтогтапсе 


РСО monitors апа adjusts autonomous оп Polic 


die power saving engines = 
| . J i 
PCU automatically adjusts for š 4 ue шы! 
Performance at high utilization Е Ж pe 
ОИЕ Switchi 
Leverages EPB to switch into £ = 
performance mode when necessary = Dynamic 
Switching 
Optimized across a range of workloads 
Single-threaded workloads 
Multi-threaded workloads 
Throughput 


PCU works synergistically with OS Power Policy 
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Running Average Power Limiting (RAPL) 


Power Limiting Quality: Power Limiting Quality: 
Based on P state Control Based on RAPL 
490 300 Limiting start 
E т Stable Ting 
Л £ 250 
$ 470 ® 
z = 
~ 460 Y. 200 
9 š 
450 ° 
2 a 150 
= 440 © 
= = 100 
430 
420 50 
0 50 100 150 200 250 300 350 0 10 | 20 30 
Time (Seconds) Time (Seconds) 


— Wall Power — Power Limit 


PSU readings ===CPU readings ===Power limit 


RAPL gives accurate and stable power limiting than P state control 


Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those 
factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. 
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Improved Efficiency with RAPL 


No Power Power limiting Power limiting 
Limiting w/o RAPL with RAPL 


420W 


Guard Band 


12 14 


Servers Servers Servers 
per per per 
Rack Rack Rack 


Max System Power: 600W 
Typical System Power: 350W 
Rack Power: 5kW 
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Socket RAPL & the Power/Performance Load Line 
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Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those 
factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. 
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250 -- 


Memory Latency 
Optimizations 


200 


2. 2 Соге е, > 2% read? 
Uncore Frequency Change bonito 
: Dynamic Memory СКЕ Disable 100 | ER 
• New LLC Prefetcher | 
* Distributed L3 9. 
Theoretical Peak: ~844GB/s (1s @ 
3.3GHz w/ 8 cores) ———M———— 
Core->L3 Read Throughput: 
>250СВ/5 (1s @ 3.3GHz w/ 8 »2x max bandwidth from Xeon 
cores) 5600 on read BW 
. Dual Load Ports on L1 D- * 3-24 channels (+33%) 
Cache e 1333->1600 (+20%) 
. SandyBridge Turbo 2.0 ° Improved Efficiency (+~40%) 


Benchmark Notes: 
* Intel internal tool for BW and Latency 


intel) 


Intel® Хеоп® Processor Е5-2600 Product Family Architecture, Power Efficiency, and Performance 


Comparison of 4 core to 8 core Scaling @ 3.3GHz 


Integer Throughput Workloads Floating Point Throughput Workloads 


200% 200% 


180% 180% 

160% 160% 

140% 140% 

120% 120% | | | 
100% 100% | П | 


Core sensitive apps in both INT апа ЕР show excellent performance 
scaling 


Memory sensitive apps show less scaling (as expected shown in red) 


Internal Testing - Estimate . . . 

4c: SNB E5-2643 w/out Turbo (1 DPC, DDR 1600) Apps highlig hted in Red are Memory 
8c: SNB E5-2690 w/ Turbo (2 DPC, DDR 1600) : LA 

1СС 12.1 / RHEL 6.1 / 2.6.32.131 Bandwidth sensitive 


Intel® Xeon® E5 uncore provides significant core Scaling 


Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those 
factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. 


Configuration Details for Foil #25 


For the SPEC benchmarks, please see http://www.spec.org for more information 

Configuration Details: As of 31 May 2012 

SAP* SD 2-tier 

2x Intel Xeon processor X5690 (12M cache, 3.46GHz, 6.40GT/s Intel ОР!) score 5220 SD users. Certification #2011005. Source: 
http://download.sap.com/download.epd?context=40E2D9D5E00EEF7C4B299992CE278ECED5166ED278FF20DF78759DC5B1E5FE79; 

2x Intel Xeon processor E5-2690 (20M cache, 2.9GHz, 8.0GT/s Intel ОРІ) score 7865 SD users. Source: 
http://download.sap.com/download.epd?context-40E2D9D5E00EEF7C5DDB3927818D67 1E00ECF023B5CE29EE68B565E9F 19F 1254 

SPECvirt sc*2011 

2x Intel® Xeon® processor X5690 (6С,12М, 3.06GHz) score 1367 @ 84 VMs. Source: http:/www.spec.org/virt_sc2010/results/res201 1q1/virt_sc2010-201 10209-00022-perf.html; 

2x Intel® Xeon® processor E5-2690 (8C, 2.9GHz, CO) score 2,388 @ 150 VMs. Source: http://www.spec.org/virt sc2010/results/res2012g2/virt sc2010-20120403-00045-perf.html 

SPECpower ssj*2008 

metrics for SPECpower are efficiency based and expressed as ssj ops/watt. 

2x Intel Xeon processor X5675 (12M cache, 3.45GHz, 6.40GT/s Intel ОРІ) score 3,329. Source: http:/www.spec.org/power_ssj2008/results/res2011q4/power_ssj2008-20110713-00386.html; 
2x Intel Xeon processor E5-2660 (20M cache, 2.2GHz, 8.0GT/s Intel ОРІ, C1) score 5,088. Source: http://www.spec.org/power_ssj2008/results/res2012q2/power_ssj2008-20120427-00454.html 
TPC-E* 
2x Intel Xeon processor X5690 (12M Cache, 3.46GHz, 2P/12C/24T) referenced as published at 1,284.14 tpsE, $250 USD/tpsE, available 5/4/11. Source: http://www.tpc.org/tpce/results/tpce_result_detail.asp?id=1 11050403; 

Intel: 2x Intel Xeon processor Е5-2690 (20M cache, 2.9GHz, 2P/16C/32T) referenced as published at 1,863.23 tpsE, $207.85 USD/tpsE, available 3/6/12. Source: http://www.tpc.org/tpce/results/tpce_result_detail.asp?id=1 12030601 
VMmark* 2 
2x Intel Xeon processor X5690 (12M cache, 3.46GHz, 6.40GT/s Intel ОР!) score 7.59 @ 7 Tiles. Source: http:/www.vmware.com/a/assets/vmmark/pdf/2011-10-18-Fujitsu-RX300S6.pdf; 

2x Intel Xeon processor E5-2690 (20M cache, 2.9GHz, 8.0GT/s Intel ОРІ, C1) score 11.13 @ 10 Tiles. Source: http://www.vmware.com/a/assets/vmmark/pdf/2012-05-15-HP-DL360pG8.pdf 
TPC-C* 
2x Intel Xeon processor X5690 (12M Cache, 3.46GHz, 2P/12C/24T) referenced as published at 1,053,100 tpmC, $0.57 USD/tpmC, available 6/20/11. Source: http://(www.tpc.org/tpcc/results/tpcc_result_detail.asp?id=1 11120802; 

2x Intel Xeon processor E5-2690 (20M cache, 2.9GHz, 8.0GT/s Intel ОРІ) referenced as published at 1,503,544 tpmC, $0.53 USD/tpmC, available 4/11/12. Source: http:/Awww.tpc.org/tpcc/results/tpcc_result_detail.asp?id=112041101 
SPECjbb*2005 
2x Intel Xeon processor X5690 (12M cache, 3.46GHz, 6.40GT/s Intel QPI) score 975,257 bops, 487,629 bops/JVM. Source: http://www.spec.org/osg/jbb2005/results/res201 141/jbb 2005-201 10215-00950.html; 
2x Intel Xeon processor E5-2690 (2.9GHz, 8C) score 1,584,567 bops. Source: http://www.spec.org/osg/jbb2005/results/res2012q1/jbb2005-20120306-01056.html 

SPECint* rate base2006 
2x Intel Xeon processor X5690 (12M cache, 3.45GHz, 6.40GT/s Intel ОРІ) baseline score 425. Source: http://www.spec.org/cpu2006/results/res2012q2/cpu2006-20120322-20154.html 
2x Intel Xeon processor E5-2690 (20M cache, 2.9GHz, 8.0GT/s Intel ОРІ) baseline score 671. Source: http://www.spec.org/cpu2006/results/res2012q 1/cpu2006-20120307-19618.html 
SPECjEnterprise*2010 
2x Intel Xeon processor X5690 (12M cache, 3.46GHz, 6.40GT/s Intel QPI) score 5,427 EjOPS. Source: http://www.spec.org/jEnterprise2010/results/;Enterprise2010.html; 
2x Intel Xeon processor E5-2690 (20M cache, 2.9GHz, 8.0GT/s Intel ОРІ) score 8,310.19 EjOPS. Source: http://www.spec.org/jEnterprise2010/results/jEnterprise2010.html 
SPECfp* rate base2006 
2x Intel Xeon processor X5690 (12M cache, 3.45GHz, 6.40GT/s Intel ОРІ) baseline score 271. Source: http://www.spec.org/cpu2006/results/res2012q1/cpu2006-20111219-19195.html 
2x Intel Xeon processor E5-2690 (20M cache, 2.9GHz, 8.0GT/s Intel ОР!) baseline score 496. Source: http://www.spec.org/cpu2006/results/res2012q1/cpu2006-20120307-19617.html 
STREAM* MP Triad (NTW) 


2x Intel Xeon processor X5690 (12M cache, 3.45GHz, 6.40GT/s Intel ОРІ) TRIAD score 42GB/s. Source: Intel TR#1241 
2x Intel Xeon processor Е5-2690 (20M cache, 2.9GHz, 8.0GT/s Intel ОРІ, C1) score 79.5 GB/s. Source: Intel TR#1241 
Linpack 

2x Intel Xeon X5690 (12M cache, 3.45GHz, 6.40GT/s Intel ОР!) score 159.4. Source: Intel TR#1236 

2x Intel Xeon processor E5-2690 (20M cache, 2.9GHz, 8.0GT/s Intel ОРІ, C1) score 347.7. Source: Intel TR#1236 
SPEC, SPECpower_ssj, SPECjEnterprise, SPECint, SPECjbb, SPECvirt sc, and SPECfp are trademarks of SPEC 


Intel® Хеоп® Processor Е5-2600 Product Family 
Generational Performance Summary 


Intel® Xeon? Processor E5-2690 (8С, 2.9GHz, 135W) vs. Intel? Xeon® Processor X5690 (6C, 3.46GHz, 130w) Turbo Enabled | 
Higher is better 
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Intel® Xeon® processor E5-2690 delivers performance gains up to 2X 


Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. 
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. 


Linpack performance may vary based on thermal solution. * i 
Configuration Details: Please reference foil 24 for details. Other names and brands шай be claimed as the property of others 


For more information go to http://www.intel.com rforman 
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Intel® Xeon® processor E5-2690 delivers performance gains up to 2X 


Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. 
Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. 


Linpack performance may vary based on thermal solution. * i 
Source: Beet ШЕ АТСА Su ae Other names and brands may be claimed as the property of others 
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Cloud Computing Technology Trends 


High Density Servers > “ Sea of CPUs” 


— Smaller & power-efficient CPUs; Beefier 
memory & IO subsystems 

— Distributed Fabric > networking & storage 
ІО sharing & virtualization 


Е ° Server-on-Chip (SoC) Approach 


— Integrated МІС and IO chipset 
— CPU/ GPU combination for HPC applications 


D * Active Power Management 


Firmware based optimization based on user 
Workload (Power is measured through TDP) 
Maximize performance while managing TDP 


CHE 
sou 


Server Standardization 


Service provider specified 

ODM designed & manufactured 

Open Source/ non-commercial SW base 
Open Stack, Open Compute 
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Cloud Servers - Typical Form Factors 


Public Cloud 


Applications 


° Scale Out Services > Hosted Mail, 
Search, Social, Cloud Hosting 


Platforms 


° Dell PowerEdge C, HP ProLiant 
Microserver, DCS custom 


Typical Specifications 


* 1/2 Socket 2/4 core 2.8GHz, 80W 
* 280 SpecintRate 
° System Power «500W; Cost <52К 
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Building Out vs. Scaling Out 


TODAY: 2RU 


* 2 Nodes per Rack Unit 

* 2 Sockets (0 95W each 

* Shared Chassis, Power Supply & Cooling 
* Google, FB, Amazon Custom Datacenters 


* 8/12 Nodes in 3RU 

* Single Socket (д 45М/ 

* Shared Chassis, Power Supply, & Cooling 

* Dell PowerEdge 5220, Supermicro MicroCloud 


TOMORROW: 10RU 


e 256/ 512 Nodes.in 10RU 

° Single Socket @ 10-20W 

• ShareddO Resources 

° Integrated ТОК Switch 

* SeaMicro SM10000, HP Redstone 


Opportunities from Hardware 


Integration Order Cores Support 


= Cores + memory + = Break tradeoff between = Improve utilization 
networking + I/O wimpy and brawny without hurting 
= Lower latency cores performance 
better QoS = Energy efficiency at 
" : E good performance 
Multiple Priorities (ARM-based 
- B/W guarantees processors are 


well suited here) 


Highly Integrated Server | Efficient Low Latency 


n Chip Interconnect 


Cloud Requirements > 
Integrated, Right-Sized Compute. Memory. Network. 
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ARMv8 (Oban): Fully Backwards Compatible 
New 64b ISA + Current 32b ISA 


New: ARMv8s 


= 64b (General) and 128b (FP,SIMD) = Reduced conditional instructions 
registers = 32 128b FP/SIMD architecture 
" SP, PC no longer general purpose registers 
registers = No SIMD on general purpose 
= Uniform load/store addressing registers 
modes = New instructions for debug, TLB, 
= Larger data and instr. offset ranges barriers 
= Simplified load/store multiple = New Crypto acceleration 
instructions instructions 


* New High Performance 64bit ISA + compatibility with existing 32bit ISA 
* Full CPU, IO, Interrupt, Timer Virtualization 

* Enhanced 128b SIMD operations 

* High performance Floating-Point operations including FMADD 

* Standard Performance Monitoring, Instr. Trace and Debug Architecture 
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X-Gene'" CPU Design Goals 


= High-Performance Low-Power Microarchitecture 
- Design point targets balance between performance, power, and size 
- Maximum "bang for the buck" 


" Low Power Microarchitecture Features 
- Sophisticated branch prediction, Caches, Unified register renaming 
- Minimal instruction replay cases 
- Separate smaller schedulers per pipe 
— Full set of power management features 


= Good Single-Thread Performance, but also Efficiently 
Scalable to Many Cores 
- Scalable CPU and interconnect architecture 2-128 cores 
- High bandwidth, low latency switch fabric > 1Tbps 


- High-performance distributed hardware 
cache coherency 


= Technology Portability 
- Fully synthesizable RTL 
- Semi-custom cell-based design methodology 


- Small targeted set of custom macros 
(plus clock distribution cells/macros) 
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X-Gene™ Processor Module 


Processor Module 


= 2 cores + shared L2 cache Processor Module 
= 4 wide out-of-order superscalar microarchitecture 


= Integer, scalar, HP/SP/DP FPU апа 128b SIMD engine 

= Hardware virtualization support 111 

= Hardware tablewalk апа nested page tables 

= Full set of static and dynamic power management features 
= Fine grain/macro clock gating, DVFS ARM 64-bit ARM 64-bit 
* CO, C1, C3, C4, C6 states CPU CPU 


Cache Hierarchy LID TT 


= Separate ІЛІ and L1D caches 

= Shared L2 cache among 2 CPUs 
= Last-level globally shared L3 Cache L2 Cache 
* Advanced hardware prefetch in L1 and L2 
* [2 inclusive of L1 write-thru data caches 


= ECC and Parity protection of all Caches, Tags, [LBs 
= Data poisoning and error isolation 
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X-Gene™ CPU Block Diagram 


From L2 Cache To L2 Cache 


BTB 
Instr Cache Data Cache 
B 


Ret Stk 


Instr Buf 
CBr Pred 


Instr Group Global Pipe Ctrl 
(о) SDB / SFB 
Instr Dec / Xlat BrChkptBuf 
(4x) 
Rename OpChkptBuf 
(4x) 
Interrupt ТЕ =. E 
InputBuf InputBuf d 
Ld 
| UE Floating 
id = Point / 
SIMD 


Simple St Pipe 
Cik & Pwr Pipe Int Pipe Complex Data 
Manage- Int Buf 


ment Pipe 


X-Gene™ Instruction Fetch 


BTB | 
Instr Cache T 
B 


Ret Stk 


І-Сасһе & Fetch Unit 


Instr Buf 
a | = Fetch multiple instructions /cycle 
= Instruction pre-decode bits stored 


with each cache-line 


Single cycle scan to pick next 
predicted taken branch 


2-level branch prediction 
Branch Target Buffer 


Conditional, call/return branch 
predictors 


History based indirect branch 
predictors 


First level fully-associative L1 TLB 
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X-Gene™ Instructions Decoding/Grouping 


Instr Group 
(4x) 


Instr Dec / Xlat 
(4x) 


Decoding /Grouping 
= Quad instruction grouping 


On the fly "CISC" instruction 
to RISC OP mapping 


Full renaming of registers 
Dispatch into execution 
schedulers 

No dispatch constraints on 
instruction grouping 
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X-Gene™ Reorder Buffer, Dispatch and Control 


Pipeline Control 

= Branch checkpoint buffer 
= Re-order buffer 

= Unified register file 


Global Pipe Ctrl 


BrChkptBuf 
OpChkptBuf 
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X-Gene™ Integer, Branch, Load and Store Units 


Integer and Load /Store Units 
" Separate branch pipe 
* Out of order schedulers 
Two integer ops /cycle 
Fully pipelined execution units 
Separate load and store pipes 
Memory disambiguation 


Branch Load Store 
Branch Simple Simple: S : 
: А tore Pipe 
ШЫН Comes Load r 
t 
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X-Gene™ FPU/SIMD 


Floating Point & SIMD Unit 
= Separate FP/SIMD renamer 
= Out of order scheduler 
Full frequency scalar FPU 
Full frequency int /FP SIMD unit 
Fully pipelined operations 
FP /SIMD Load and FP/SIMD 


Store and Reg Op per cycle 


Sched 


Floating 
Point / 
SIMD 

Pipe 
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X-Gene'" Data Cache 


Data Cache 
= First level fully-associative TLB 


" Write-through to L2 with 
write-combining 


" Store to load forwarding 


= Banked data arrays for 
performance and low power 
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X-Gene'" Memory Management 


| 


MMU 


> L2 TLB <-> 


Memory Management Unit 


= Set-associative 
second-level TLB 


= Supports all architecture 
page sizes 


= Nested page-table walker 
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X-Gene™ — CPU Memory Subsystem 


= Modular architecture 
= Three level cache hierarchy 
= Globally shared L3 Cache 


High-performance Symmetric 
Multi-core Design 


= Runs at full CPU frequency 

= <15ns latency, ~200GB/s B/W 

= Over 400 transactions in flight 

= Central snoop controller and ordering point 

= Decoupled frequency and power domains 

= Support global cache and TLB inv operations 


Coherent Network 


I Scaleable Ë 
| Modular L3 Bridges 
Cache 


To DDRs " = = кте То DDRs Е Метог Brid es to 
<» | А қт a> a 
ae — | DC p HR Ga ық pA " |О Br idge for SOC 
Рей L | _ м connectivity 
НИ} 
IO Bridge 
1 To IO Network 
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X-Gene™ Server on Chip 


ramai ae 
L11 
и PM 1 Multi-Channel 
-DI -DI 
E E E E E DDR3 
Memory 
Ей 


12 САСНЕ 12 САСНЕ 


Offloads Networking I/F Comms I/F Storage I/F 


10G 1/0 PCle SATA 


64bit ARM Server Class CPU > Multi-core for Distributed Computing 
Increased Memory Capacity and 10G I/O Integration 
Integrated Peripherals and L2 Switching 


Workload Specific Acceleration 
Available: 2H'12 
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Right Sizing + Connected On-Chip Fabric 


1 = Customizable Blade Design 


— Configurable swappable blades 
within 1 sled 

— Networking/ Compute/ Storage 
shared over common bed of CPUs 
> Saves Power & Cost 


E Overall System Optimization 


a C biliti — Integrated NIC and IO chipset 
ystem Capabilities — Load Balancing Across multiple 
І blades їо Optimize System Balance 
Š s aie н т n — Shared Resources for System 
р Management, Power and Cooling 


- 100s of Gbps of network bandwidth 
- 10s of Tbps of interconnect fabric bandwidth 
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X-Gene ™ 
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Адепаа 


€ Fujitsu Processor Development History 


Ф SPARC64'M X 
B Design concept 
B SWoC (Software on Chip) 
B Processor chip overview 
ш u-Architecture 
E Performance 


€ Summary 
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Fujitsu Processor Development 
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SPARC64™ X Design Concept 


% Combine UNIX and HPC Fu processor features to realize 
an extremely high throughput UNIX processor. 

ш SPARC64 VII/VII+ (UNIX processor) feature 
ө High CPU frequency (up-to 3GHz) 
ө Multicore/Multithread 
ө Scalability : up-to 64sockets 

ш SPARC64 VIIIfx (HPC processor) feature 
@ HPC-ACE: Innovative ISA extensions to SPARC-V9 
@ High Memory B/W: peak 64GB/s, Embedded Memory Controller 


@ Add new features vital to current and future UNIX servers 
ш Virtual Machine Architecture 


ш Software On Chip 
ш Embedded IOC (PCI-GENS controller) 


ш Direct CPU-CPU interconnect 


TM 
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Software оп Chip 1/2 


€ HW for SW 
Accelerates specific software function with HW 


% The targets 

ш Decimal operation (IEEE754 decimal and NUMBER) 
ш Cypher operation (AES/DES) 

ш Database acceleration 


% HW implementation 
ш The HW engines for SWoC are implemented in FPU 
° To fully utilize 128 FP registers & software pipelining 


B Implemented as instructions rather than dedicated co-processor to 
maximize flexibility of SW. 


B Avoid complication due to “CISC” type instructions 
e Various “RISC” type instructions are newly defined, instead. 
e 18 insts. for Decimal, and 10 insts. for Cypher operation 


™ 
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Software on Chip 2/2 


Decimal Instructions 


€ Supported data type 
ш |ЕЕЕ754 DPD(Densely Packed Decimal) 
8B fixed length 
ш NUMBER 
Variable length (max 21Byte) 


€ Instructions 
ш Both DPD/NUMBER instructions are defined as 8B operation 
(add/sub/mul/div/cmp) on FP registers ШЫП 
ө To maximize performance with reasonable 
HW cost 
ө When the data length is > 8byte, multiple such 
instructions will be used. 
ш An instruction for special byte-shift on FP 
registers is newly added to support unaligned 
NUMBER 


Fd[rd] 
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SPARC64"™ X Chip Overview 


ө Architecture Features 
° 16 cores x 2 threads 
e SWoC (Software on Chip) 
e Shared 24 MB 12% 


* Embedded Memory and IO 
Controller 


e 28nm CMOS 
• 23.5mm x 25.0mm 
e 2,950M transistors 
* 1,500 signal pins 
° 3GHz 


ө Performance (реак) 
e 288GIPS/382GFlops 
e 102GB/s memory throughput 


™ 
SPARC64 X 7 All Rights Reserved, Copyright© FUJITSU LIMITED 2012 


SPARC64™ X Core spec 


Instruction SPARC-V9/JPS 
Set HPC-ACE 


Architecture | УМ 
SWoC 


Branch 4K BRHIS 
Integer 156 |156 GPRx2+64GUB | x 2 + 64 GUB 
Execution | ALU/SHIFT x2 


Units ALU/AGEN x2 
MULT/DIVIDE x1 


FP 128 FPR x 2 + 64 FUB 


Execution FMA x4, FDIV x2 
Units IMA/Logic x4 
Decimal x1 / Cypher x2 


[11$ 64KB/4way 
L1D$ 64KB/4way 


TM 
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u-Architecture enhancements 


from SPARC64™ VII+ 

% CPU Core 
Deeper pipeline to increase Frequency 
Better Branch Prediction Scheme 
Various Queue-size and #Floating point register increase 
Richer execution Units, including 

ө 2ЕХ + 2ЕАС — 2ЕХ + 2EX/EAG 

ө 2ЕМА — АҒМА to support 2way-SIMD 

ө SWoC engine (Decimal and Cypher) 
More aggressive О-О-О execution of load and store 
Multi-banked 2port L1-Cache 


€ System On Chip 
ш #core and 12% size (4соге/12МВ— 16соге/24МВ) 


B Memory Controller, IO Controller, and CPU-CPU I/F are all 
embedded to increase performance and reduce cost. 


ТМ 
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SPARC64"™ VII/VII Pipeline 


Fetch Issue Dispatch Reg.-Read Execute | Memory . Commit 
(4 stages) 2 не (4 stages) _ (LIS: 3 stages) КЕ stages) 


=l 
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SPARC64"™ X Pipeline 


Fetch Issue Dispatch Reg.-Read Execute Memory | . Commit 
(4 stages) (4 stages) (^ stages) (LIS: 3 stages) x (2 stages) 


-| 
16-core = 


Метогу 
Controller 
CPU- 14 VF PCI-GEN3 


DIMM 
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SPARC64™ X 


Execution units enhancements (Ex.) 


© Integer Execution Unit 
m 2EX + 2EAG — 2EX + 2EX/EAG 
m 2 — 4W GPR 


— 4 integer instructions can be executed 
per cycle (sustained) 


Update Commit 


@ Execute load without waiting for preceding store L1 cache 
address calculation. 16B load 
| 2RAW| х2 
ш Multi-banked 2port L1-cache to execute 2 load or 
1 load+1 store in parallel Е 


ш Doubled 11% bus size Вы ` 
ш Doubled 11% associativity (2—4way) 


— Increase L1-cache throughput and hit-rate 


% Load Store Unit 
ш Aggressive load/store О-О-О execution: 


SPARC64™ X 12 
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SPARCGA'" X interconnects 


SPARC64™ VII/VII+ interconnects 


(SPARC Enterprise M8000) m SPARC64™ VII/VIl+ interconnects 


ө 4 CPU require 8 additional LSIs to be 
connected with DIMM 


ө DIMM i/f: 4.35GB/s (STREAMtriad) 


m SPARC64™ X interconnects 


ө No additional LSIs to be connected 
with DIMM 


ө DIMM if: 65.6GB/s (STREAMtriad) 

e CPU if: 14.5GB/s x 5ports (peak) 
* 9 ports: glueless 4way CPU interconnect 
* 2 ports: » 4way CPU 
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High Speed Transceivers (SerDes) 


% CPU-CPU glue-less communication links 
ш 14.5Gb/s x 8 lanes bi-directional serial 
interface, 5 ports 
ш Embedded equalizer circuit enables long 
distance signal transmission 


ш Embedded adaptive control logic optimizes 
equalizer parameters automatically 
depending on the various system 
configurations 


% PCI Express ports 
ш 8Gb/s x 8 lanes (Gen З), 2 ports 


14.5Gb/s x 8lanes SerDes 


% Built-in SerDes provides peak 88.5GB/s x2 (up/down) total 
throughput 


™ 
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Reliability, Availability, Serviceability 


: SPARC64™ X RAS diagram 
Error detection and 
correction scheme 


Cache (Tag) ECC 
Duplicate & Parity 


Cache (Data) ECC 
Parity 


Register ECC (INT/FP) 
Parity(Others) 


ALU Parity/Residue 


Cache dynamic Yes 
degradation 


HW Instruction Retry Yes 
History Yes 


Green: 1bit error Correctable 
Yellow: 1bit error Detectable 
Gray: 1bit error harmless 


€ New RAS features from SPARC64™ VII/VII+ 
ш Floating-point registers are ECC protected 
B #checkers increased to ~53,000 to identify a failure point more precisely 
— Guarantees Data Integrity 
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Hardware Instruction Retry 


4. Update of SW 
visible resources 


SW visible SW visible 
resources L resources | 


5. Back to normal execution after the re-executed 
Instruction gets committed without an error. 


% When an error is detected, Hardware re-execute the instruction 
automatically to remove the transient error by itself. 
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Hardware measured results 


Relative to SPARC64™ VII+@2.86GHz 


AES-256-CBC 
AES-256-CBC 


Single Thread Throughput 


SPARC64™ X realizes 7x INT/FP/JVM throughput and 15x memory 
throughput of SPARC64™ VII+ 

The INT/FP/JVM result is with un-tuned Compiler/JVM. 
SWoC of SPARC64'M X results in max 98x throughput. 

The NUMBER score is for scalar. Expect to be much better for vector data. 
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SPARC64™ VII+ v.s. SPARC64™ X 


INT (single thread) Hardware measured results 
Lower 4 
Performance | 
Higher m 
Performance > 


Vile) X VII X мн X М X М X |МИ- X У X У X Уч X У X VI X VI+ X VI | X 


perlben bzip2 gcc mcf gobmk | hmmer sjeng libquantu 264ref omnetpp| astar xalancbm&EOMEA 


ш 4 Commit m2-3 Commit m 1 Commit m O Comiit due to Core mO Comiit due to Cache m O Comiit due to Мет 


4 integer execution units and write port increase of GPR (integer register) improves 
overall performance. 


Memory latency reduction, Large L2$, branch prediction, and L1$ improvement also 


contribute to the high performance dramatically. 
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биттагу 


€ SPARC64™ X is Fujitsu'sTO^ SPARC processor which has 
been designed to be used for Fujitsu’s next generation UNIX 
server. 


€ SPARC64™ X integrates 16cores + 24MB L2 cache with 
over 100GB/s(peak) memory B/W. 


€ SPARC64™ X keeps strong RAS features. 


€ SPARC64'M X chip is up and running in the lab. 


€ It has shown 7 times throughput of SPARC64™ VII+ w/o 
compiler tuning. 


€ SWOC is effective to accelerate specific software functions 


€ Fujitsu will continue to develop SPARC64'M series. 
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Abbreviations 
. SPARC64™ X 


— IB: Instruction Buffer 

— RSA: Reservation Station for Address generation 
— RSE: Reservation Station for Execution 

— RSF: Reservation Station for Floating-point 

— RSBR: Reservation Station for Branch 

— GUB: General Update Buffer 

— FUB: Floating point Update Buffer 

— GPR: General Purpose Register 

— FPR: Floating Point Register 

— CSE: Commit Stack Entry 
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SPARC Т5: 16-соге СМТ Processor with 
Glueless 1-Нор Scaling to 8-Sockets 


Sebastian Turullols and Ram Sivaramakrishnan 
Hardware Directors, Microelectronics 


TTITTPPTTTTTTTTT 
CHIPS 


The following is intended to outline our general 
product direction. It is intended for information 
purposes only, and may not be incorporated into 
any contract. It is not a commitment to deliver any 
material, code, or functionality, and should not be 
relied upon in making purchasing decisions. The 
development, release, and timing of any features or 
functionality described for Oracle’s products 
remains at the sole discretion of Oracle. 
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ВИ Outline 


* Design Objectives 

e SPARC Т5 Processor Overview 

e Core S3 

* Cache Hierarchy Components 

• Internode Coherency for 8-Socket Scaling 
* Power Management Advances 

e PCI-Express Gens I/O Subsystem 


e Summary 


И SPARC T5 Design Objectives 


* Multiply performance 

* Achieve highly efficient 8-socket glueless 
1-hop scalability 

* Optimize for Oracle workloads апа 
Engineered Systems 

e Maximize power efficiency 

* Provide Enterprise Class RAS 
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Т5 FAN sail 


16 53 cores @ 3.6GHz 
8MB shared L3 Cache 


8 DDR3 BL8 Schedulers 
providing 80 GB/s BW 


8-way 1-hop glueless 
scalability 


Integrated 2x8 PCle Gen 3 


Advanced Power 
Management with DVFS 


БЕ 53 Соге Несар 


e 28nm port from 40nm T4 
* Out-of-order, dual-issue 
• High frequency achieved with 3.6GHz 
16 stage integer pipe 
e Dynamically threaded, one to eight strands 


“ Accelerates 16 encryption algorithms and 
random number generation 
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E SPARC T5 Leads in On-Chip Encryption Acceleration 


" Built In, Zero- On-Chip Intel 
Accelerators SPARC T5 IBM Power7 | Westmere/ 
overhead crypto есе Sandybridge 
e Works with Solaris Asymmetric /Public Key ВА, pu DSA, RSA, ECC 


: Encryption HEUS 
ZFS file system for 
faster file system Symmetric Key / Bulk AES, DES, 3DES, 


А Encryption Camellia, Kasumi none AES 
encryption 
ө | CRC32c, MD5, 
Provides ше Message Digest / pA P HE T 
Hash Functi -256, - 
consolidation with ash Functions He O 
dynamic VM 
m ig ration jd ү Supported none Supported 
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e SPARC T5 Processor Overview 

e Core S3 

e Cache Hierarchy Components 

• Internode Coherency for 8-Socket Scaling 
* Power Management Advances 

e PCI-Express Gens I/O Subsystem 

e Summary 
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Core Caches 


“16 KB 4-way set associative L1 Instruction cache 
e 16 КВ 4-way set associative write through L1 Data cache 


e 128 KB 8-way set associative, unified, inclusive L2 cache 
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L2-L3 Interconnect 


e 8x9 Crossbar Switch connects the 16 cores to 
— 8 address interleaved address banks and 
— an І/О bridge 
e The L3-L2 direction contains a control and data network. 


— control network provides a heads up for dependent instruction 
wake-up 
— Data network is used to return line fill data and send L3-L2 snoops 


* Crossbar network has a bisection BW of 1 TBps, 2x T4 
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L3 Cache Overview 


8MB (1MB per bank) 16-way 
set associative inclusive cache 
MOESI Coherence States per 
64B cache line 

* Tag results are used to 


— qualify data array access 


— align dependent issue with data 
return or conditionally flush thread 


* Precise tag "reverse" directory 
keeps L2s coherent 


L2Req 12 МВ 
гед 


L3 hit / 


DMA or 
Coherence 
requests 
| Line fill or L3 Line 
L3 CB/ ОМА write Miss fill vid 
WB data data Req 
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и L3 Cache Overview (continued) 


* Speeds up IO by allocating DMA buffers in the cache 
— Enhances clustered application performance 

* Acceleration of contended locks 
- L3 forms а chain of same address requests 
— Processes them atomically on receiving an exclusive copy 

e Supports coherent flushing and retirement of cache 
lines to avoid persistent errors 
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ВИ Outline 


* Design Objectives 
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e Cache Hierarchy Components 

• Internode Coherency for 8-Socket Scaling 
* Power Management Advances 

e PCI-Express Gens I/O Subsystem 

e Summary 
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=) Internode Coherency Overview 


* Glueless 1-hop scaling to eight 
sockets 

• A precise directory tracks all L3s in the 
system 
— striped across all processors 
— stored in on-chip SRAMs 
— flexible for different socket counts 

* Higher BW efficiency than snoop- 
based protocols enables better scaling 


— 5096 more effective bandwidth than 
comparable snoopy implementation 


ORACLC 


БЕ Internode Соһегепсу Fabric 


t 


ша 


ORACLE 


e Each link is 14 lanes wide and 
runs up to 15Gbps per lane 


* Directly connected links minimize 
latency 


“ Trunked links achieve more 
bandwidth in smaller configurations 


* Supports single lane failover 


Internode Performance Optimizations 


* Speculative memory reads prior to cache line 
serialization in the directory 


e Cache-to-cache line transfers between nodes 


e Dynamic congestion avoidance routes inter-node data 
around congested links 
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и Internode Transaction Flow 


1. À Requester issues a 
NodeRequest to the Directory 


S 
СЯ and а SpeculativeRead to the 
te Memory Home 
NodeRequest =} 2. After a Directory lookup, either 
E a HomeRead or a SourceData 
A request is generated 
EC 3. Data is returned from the 


Memory Home or C2C Slave 
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И Dynamic Congestion Avoidance 


Fwdata | пи|иииининиининш «ф» wdataQ 
| FwdataQ 
Data | Intermediate 
Sourcing Node 
Node DdataQ sh. ааа 
Data : 
Requestin Е 
4 9 FwdataG : 
Node Duala : 
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ш Т5-8 Bandwidth 


DDR3-1066 1+TB/sec 


Coherency Bisection 
Bandwidth 840 GB/sec 


PCI Gen3 Bandwidth 
256 GB/sec 
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E Multiprocessor Performance 
T5 8-Socket Scaling 
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ВИ Outline 
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e SPARC T5 Processor Overview 
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* Cache Hierarchy Components 

• Internode Coherency for 8-Socket Scaling 
* Power Management Advances 

e PCI-Express Gens I/O Subsystem 


e Summary 


__ Power Management Advances 


Hardware saves power below 100% 
uti lizatio П with : ORACLE Integrated Lights Out Manager 


_ Chip wide DVFS User: root Role: aucro SP Hostname: pm-sp-13 


System Information 


2 Warnings EEE 


Power Management Settings 


um Per CO re pair cycle skippi ng sci View and configure the power policy from this page. More details. 


. Метогу 
- SerDes power scaling -= Power Poic: БЕ 
IE = ys eres All c ae at full speed/capacity. 
— DIMM off-lining w/ Dynamic Reconfiguration s ` nic Component wre ought i er ш of Swe peed o a 


— DRAM PPSE and PPFE support He 


Open Problems (1) 
— PCI Express Power Management — sna 
— Clock Gating крон 


When реак performance is demanded = 


Statistics. 


— Power Management Controller achieves History 


+ ILOM Administration 


maximum frequency within customer 
imposed power and thermal limits 
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__ Power Management Controller: Elastic Savings 


* Hardware saves power below Power vs Frequency with DVFS 
100% utilization 
— Chip wide DVFS 
— Per core pair cycle skipping 
e Software monitors frequency 
needs of all cores 


— Puts chip at DVFS point 
satisfying all cores requirements 

— Puts core pairs at lowest cycle 
skip ratio satisfying 2 cores in 
the pair 
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Соһегепсу Link Power Savings 


e Link scaling (4,3,2,1 dynamically as needed) 
— Hardware monitors link utilization 
— Software sets entry exit policy (thresholds and dwell times) 


р 
Ы-Ы- д a- 
> 


4 links 1 link 4 те 


<=> => 
СЕ - Ш=Ш - RR Rl 
«а» ae 
2 links 


Т5 m 


3 links 2 links 
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И Memory Link Power Savings 


° Two-levels of memory link standby 
—LOs: Power savings with fast wake up 
e Light sleep for М frames, then wake ир and listen for data 
—L1: Much more power savings with longer wake up 
* Completely power off both tx and rx except for PLL 
* Used for unallocated memory regions 
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__ Peak Performance Thermal Management 


• 4 thermal diodes per chip 
— centered in core quads 
e |f any T > high-water mark 
Drop Freq, V 
od ‚ If all T < low-water mark 
Raise Freq, V 


alarms 


Temp Sensor 
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__ Peak Performance Current Management 


e Drop F,V if any current 
> high-water mark 

* Raise F,V if any current 
< low-water mark 

* Controls currents for Jresume 
CPU VDD plus d 
motherboard and 
DIMMs 


throttle 
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e Summary 


Е T5 PCle Subsystem 


* Dual x8 PCI Express Gen 3 ports provide 32 GB/s peak b/w 


e Supports Atomic Fetch-and-Add, Unconditional-Swap and 
Compare-and-Swap operations 


* Accelerates virtualized ИО with Oracle Solaris VMs 


— 128k virtual function address spaces ensure direct 
SR-IOV access for all logical domains 


— 64-bit DVMA space reduces IO mapping overhead, 
improving network performance 


— Guarantees fault and performance isolation among guest OS instances 
e Supports PCI Express Power Management 
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T5 PCle Progression 


во 
| PCI Express revision | Gen 2 (dual x8 ports) Gen 3 (dual x8 ports) 
| Data Management Unit | Single shared unit Two independent units 
for both x8 PCle ports one for each x8 PCle port 
| TLP Processing Hints | No Yes, directs data to L3 cache 


PCle 2.0 compliance (ЕСМ 
“Internal Error Reporting” 


Signaled ма MSI interrupt  Signaled ма PCle message 
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* Design Objectives 

e SPARC Т5 Processor Overview 

e Core S3 

* Cache Hierarchy Components 

• Internode Coherency for 8-Socket Scaling 
* Power Management Advances 

e PCI-Express Gens I/O Subsystem 


« Summary 


u SPARC T5 Summary 


* Processor provides 
— Leadership throughput and per-thread performance 
- The industry's best on-chip encryption acceleration 
— Advanced power management 
— Highly-efficient one hop glueless scalability to 8 sockets 
— Enterprise-class general purpose computing and RAS 


e SPARC T5 is the world's best processor for running 
Oracle software 
— Oracle Database, Fusion Applications, Fusion Middleware 
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И Design Objectives Achieved 


* Oracle workloads 
e Engineered Systems 
* Extends 
Y on-chip crypto 
acceleration 
Y RAS 


e Scales to 8 sockets 
using directory 

e Minimizes latency 

* Avoids congestion 

* Maximize bandwidth 


Optimize 
Systems 


Multiply 
Performance 


Scale Advance 
Efficiently Power 
Management 


* Double cores and 
cache 

* Balance single thread 
and throughput 

e Dynamically thread 


e Maximizes peak 
performance 

e Manages thermal and 
current loads 

e Scales elastically 
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Hardware and Software 


Engineered to Work Together 
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Smarter Systems for а 
Smarter Planet 


IBM zNext — 
The 3rd Generation High Frequency Microprocessor Chip 


Chung-Lung (Kevin) Shum 


Senior Technical Staff Member, System z Processor Development, Systems & Technology Group, IBM Corp. 


© 2012 IBM Corporation 
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zNext PU Chip Overview 


" IBM mainframe microprocessor chip for 


the next generation of System 2 
servers 


" 32nm SOI technology 
- 597 ттё (23.7mm x 25.2mm) 
— 15 layers of metal, 7.68 miles of wire 
— 2.75 Billion transistors 


— 1/05: 10000+ Power, 1071 Signal 
SMP connections to external Hub chip (SC) 
I/O Bus Controller (GX) 
Memory Controller (MC) with prefetching 


= Chip Features (vs. 2196) 
— 6 new cores per chip (vs. 4) 


— Core-Dedicated (vs. shared) Co- 
Processors 


- 48 МВ EDRAM on-chip shared L3 (2x) 


= Processor Core Features 
- 274 Generation out-of-order design 
— Speed & feed improvements 
— Microarchitecture innovations 


— Architecture extensions for software 
exploitations, e.g., 
Hardware Transactional Memory 
Runtime instrumentation 
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Speed: Higher Operating Frequency 


6 5.5 3000 


GHz 33" zNext 
5 т ж) 2500 
4 2000 
3 / / 1500 


GHz 
Millions of transistors 
on Processor Chip 


2 Hz 1000 
1.2 
GHz 
1 MHz 500 
0 
2000 2003 2005 2008 2010 2012 
2900 2990 29 210 2196 2Мехі 


= 2900 — Full 64-bit z/Architecture = 210 — Deep Pipeline, Arch. extensions 
= 2990 — Superscalar CISC pipeline | = 2196 — Out-Of-Order (ООО), Extensions, Enablement for new 
" Z9 — System level scaling Additional Architectural Extensions Software Paradigms 


3 O 2012 IBM Corporation 


= zNext — ООО-, Architectural 
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Speed: Higher Operating Frequency 


6 5.5 3000 
Hz 
5.2 
GHz zNext 
5 " 2500 Ф 
GHz og 
@ = 
E 
2000 
4 © 2 
59 
= 99 
Оз "Technology + = 
Design Optimizations ры 
gn Up с 
1.7 —Provides higher performance & 
2 Hz capacity 
1.2 
GHz —Maintains unmatched reliability 
770 
1 MHz — Supports Peak workload 24x7 
— At similar power constraints 
0 
2000 2003 2005 2008 2010 2012 
2900 2990 29 210 2196 2Мехі 


" 210 — Deep Pipeline, Arch. extensions 
= 2990 — Superscalar CISC pipeline | = 2196 — Out-Of-Order (ООО), Extensions, Enablement for new 

= 29 — System level scaling Additional Architectural Extensions Software Paradigms 
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Feed Improvements: Maximizing Out-of-Order Window 


= Improved dispatch grouping efficiencies 
> More instructions per group 
— Reduced cracked instructions overhead 
— Increased branches per group to 2 
— Added Instruction Queue (InsnQ) for re-clumping 


*clumps — parcel of instructions delivered from instruction fetching 


Clump of instruction (up to 3) 


micro-ops issue (up to 5) 
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Feed Improvements: Maximizing Out-of-Order Window 


= Increased out-of-order resources 
> More out-of-order groups 
— Multi-grouped instructions speculative completion 
— Increased Global Completion Table (ОСТ) entries to 30x3 (+25%) 
— Increased usable physical GR entries to 80 (+25%) 
— Increased physical FR entries to 64 (+33%) 


Clump of instruction (up to 3) 


ООО+ 


ispatch Group (up to 3 micro-ops) 


micro-ops issue (up to 5) 
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Feed Improvements: Maximizing Out-of-Order Window 


= Increased execution bandwidth 
> More instructions issued per cycle 
— Added Virtual Branch Queue (МВО) for relative branch queuing 
— Added Virtual Branch Unit (VBU) for relative branch execution 
— Increased effective issue queue size to 32x2 (+60%) 
— Increased issue bandwidth per cycle to 7 (+40%) 


Clump of instruction (up to 3) 


ООО+ 


ispatch Group (up to 3 micro-ops) 


issue (up to 7) 
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Instr (clump) 


decode 


E 


T dep mtrx 
— s EJ 


2196 Pipeline (recap) 
Ш. 


regfile out of order execution 1 
read 


LSU (2) 


[ 38 | 
H 
8 
8 
H 
H 
| 


5 
6 
= © 
UJ LL: TI 
Tl > >< 
с = 
d | 
checkpoint | 
ІЛ! z [s] Е а 


Diagram based on Brian Curran's HOTCHIP 22 presentation © 2012 IBM Corporation 
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Instr ( (clump) zNext Pipeline | | new stages 
[| RISC 
Execution units 


decode ща 
І regfile out of order execution ! 


ol ТЕР геаа VBU (2) I 


Queue Ж 


[o] Ms | Ма wake age issue 1 х Tm сш. 
ой group Orme [1] [s2] srl 


DU 


LSU (2) | 


H 
8 
8 
H 
ISl: 
| 


б- | 
ll 
N | 


map 


dep mtrx 


| 
| 
І 
| 
| 
| адеп format back 
| 
| 
| 
| 
wrt І 


маке ӨЛЕ [55и ЕХ ЕЗ І 
mtrx 


checkpoint 


І 
| BFU | 
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Feed Improvements: Accelerating Specific Functions 


" Short-circuit executions 
— Common idioms executed during dispatch 
— e.g. initializing a GR with zeros 


= D-Cache (SRAM design) with banking support 
— 32 banks for concurrent 2 read and 1 write operations 
— Faster cache writes reduce future load-use delays 


= Dedicated Fixed-point divide engine resulting in 25-65% faster operations 


" Millicode (Vertical Microcode) operations 
— Selective hardware execution 
Translate, Translate and Test, Store Clock 


— Shorter startup latency 
Move Character variations, Co-Processor operations 


— Hardware assists for prefetching (target cache level & coherency state) 
Move Character Long variations 


— Dedicated hardware for Unicode conversion (UTF8 <> UTF16) 


10 © 2012 IBM Corporation 


Smarter Systems for a Smarter Planet 


Micro-Architecture Innovations: Branch Prediction 


" Branch Prediction is essential in improving performance 
- 2" level ВТВ (ВТВ2) for capacity (more than Зх) 
— Fast re-Indexing Table (ҒІТ) for latency (up to 33% reduction) 


Speculative 
BHT & PHT РНТ 
3+2 entries АК entries 2k entries 


is BPOQ 
т Branch Prediction Logic 
New branches x | | Description | 


BTBP | Branch Target Pre-buffer | 0.5‘ level Branch IA and target predictor 
look-up in parallel to BTB1, upon usage, transfer to BTB1 
| redictions t 
| : 


L2 BHT 
(SGO) 
32k entries 


CTB 


1e ruci € ni I 
Branch History Table Direction predictor (2-bit) 


completion 


11 Diagram & Table based on Eric Schwarz's 2011 VAIL presentation @ 2012 IBM Corporation 
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Micro-Architecture Innovations: Cache Subsystem 


s Split Level 2 Cache (instead of unified) 
— 1M-byte Instruction, 1M-byte Data 


— Inclusive of instruction-L1 (64 Kbyte) and 
data-L1 (96 Kbyte) 


— Bigger aggregate L2 with shorter latency 
" Integrated data-L2 directory 
— data-L2 directory is merged into data-L1 directory 
— Logically indexed like in data-L1 directory 
— L2 Hit / miss knowledge at L1 miss time 


> L1 miss, L2 hit latency reduced by up to 45% ЕЯ 


= Store “Gathering” Cache 
— circular queue of 64 entries of half-lines (128 bytes) 


Store Cache Rate Reduction (# of DW) 
— merges stores to same half-line post L1 updates оок 
— reduces pipeline usage for stores іп L2 апа L3 PS PEE km 2 am 
— Hardware Transactions storage updates ; а. И Е 2 
> Store traffic to L3 typically reduced by ~50% | ^" Е | 


Modeling Data provided by Jim Mitchell @ ІВМ Poughkeepsie 
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Targeted Architectural Extensions 


" 2 Gigabyte Page support 


" Decimal Floating Point Extension 


— Instructions to convert numeric data between 2 formats: 
zoned fixed-point decimal, and 
decimal floating point 


= Instruction Processing Directives 


— Branch preload instructions 
Specifies the address of a branch instruction and its target to be installed into 
branch prediction tables (through BTBP) 


— Data access intent instruction 
Specifies what operands of the next instruction may be further accessed for 


e.g. getting a cache line exclusive on a load for future store 
e.g. keeping access-once line at current Least-Recently-Used (LRU) position 
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Architectural Differentiation Extension: Transactional Execution 


= General Purpose Multiprocessor Support 
— Instructions specifying start, end, and abort of a transaction 


— Pending storage updates are “shielded” from other processors until transaction completes 


— Implemented at heart of CPU (core+L1) for performance 
— Heavy focus on support for software usage and debug 
— “Constrained Transaction” with hardware auto-retries for code 
simplification 
" Prototype benchmark with HTM 
— Showed ~2x improvements and better scalability (slope) 


Concurrent Linked-Queue Benchmark w/ Java Prototype 


Throughput 


2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 
# of Threads 


Prototype Data provided by Jerry Zheng, Marcel Mitran @ IBM Toronto 


См 


Ns 


En-queue 
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Architectural Differentiation Extension: Runtime Instrumentation 


= Low overhead profiling with hardware support 


— Instruction samples by time, count or explicit marking 


s Sample reports include hard-to-get information: 
— Event traces, e.g. taken branch trace 


— “costly” events of interest, e.g. cache miss information 


— GR value profiling 
= Enables better “self-tuning” opportunities 


{ IN ti " il I 
Immediate representation generator | 


Орїїтїгег | 
| 


Code generator | 


ES 


6 (analyze) 


р IA 
Event Tracing 1А100 


ХЕ 
CB head GR1 


Circular 


Collection ВиНег 
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Summary: zNext will..... 


" Be used in a new family of IBM System z mainframe servers 


= Sustain IBM's mainframe leadership in computing capacity and 
performance without sacrificing any reliability, with 
— Up to 6 active cores per chip 
— 48M-byte shared on-chip L3 cache 
— uniquely designed low-latency private L2 cache 
— >24K target and >32k direction branch histories 
— Numerous micro-architectural enhancements* 


" Provide architecture extensions*, and 
be the 15! general purpose microprocessor to support 
— hardware transactional memory 
— software self-directed run-time profiling 


" Be amongst the fastest microprocessors @ 5.5 GHz 
- joining 2196 @ continuous clock-speed of >5 GHz 


* Not all features and extensions described in this presentation 
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ABSTRACT 


Power management is of increasing concern and challenge to SOC 
and product designers [1], [2]. Power Gating (PG) is now well 
understood as a technique for reducing static leakage power when 
circuits are idle [3]. State-Retention Power Gating (SRPG) 
enhancements in hardware [4] can address fast wake-up latency 
and transparency to system software but have area, performance 
and robustness/reliability impacts that need minimizing [5]. 


This presentation addresses practical application of State Retention 
Power Gating to CPU subsystems, (but applicable to other SOC 
sub-systems) and covers what matters from the system and RTL 
designer perspective building on the EDA implementation support 
from UPF [6] and CPF [7] power intent. 


Current EDA support for Power Gating is tuned around “logic- 
level” drive of power gates. The new techniques that are described 
and contrasted build on the multi-voltage aware tools and formats 
to add enhanced power gate performance as well as addressing 
state retention without the traditional area and timing penalties. 


The work described in this paper is at an applied research phase 
and has been undertaken in collaboration with researches in the 
Electronics and Computer Science faculty of the University of 
Southampton in the UK; the technology demonstrator implemented 
in Silicon (on a 65nm Low Leakage process) was co-developed and 
fabricated using the EUROPRACTICE “mini@sic” Multi-Project 
Wafer service with TSMC Inc as the semiconductor foundry [8]. 


1. BACKGROUND 


The research group at ARM has worked for a number of years with 
customers and leading EDA partners to take complex 'expert' low- 
power industry techniques and facilitate their successful adoption 
for standard System-on-Chip designers and implementers. This 
increasingly requires the development of Physical IP components 
and model abstractions that support the current and evolving Multi- 
Voltage tools and UPF and CPF standards. 


2. BUILDING ON BASIC POWER GATING 


The multi-voltage tools support and associated power intent now 
prove to be a foundation for more advanced techniques to improve 
on the base-line power-gating and state retention support envisaged 
as the EDA tools were developed [9]. 


2.1 Multi-Voltage Power Gating 


Industry standard “Multi-Voltage” EDA tools support logic level 
drive of the gate terminal of the power switches, while more expert 
approaches have traditionally been required to add Gate Bias to 
improve the off-current (ratio) of power gates [10]. 


А Super-Cutoff CMOS “buffered” power gate cell family with 
integrated level shifting has been developed to work seamlessly 
with standard EDA MV tool flows (shown in figure 1). Header 
power gates are of primary interest to facilitate simple generation 
of gate bias supply voltage (the core voltage rail augmented by 
small charge pump or regulated from a higher ТО supply rail). 


VDDGB 
VDD 
=> 
VVDD 
+. 
vss 


— —— MV 
Figure 1: SCCMOS enhanced power gate 


The multi-voltage internals of the enhanced switch are hidden from 
the implementation tools and support lower off-current with High- 
Vth “MTCMOS” power switches, or lower-IR drop with standard 
Vth switches. 


3. ENHANCING BASIC STATE RETENTION 


The experimental approach adopted has been to amortize the cost 
of state retention across multiple registers by splitting the power 
rails for high performance flip-flops (a near-zero area cost) and 
amortize the retention cost by managing the clamping of clocks and 
resets efficiently in the SOC implementation flow such that the 
speed and area impacts are minimized over and above the cost of 
Power Gating that designers well understand. 


Figure 2 illustrates how the retention power domain is distributed 
to manage “live-slave” state retention between clock-gates and 
registers. For registers with asynchronous reset controls such 
controls must also be explicitly clamped similarly. 
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Figure 2: Advanced SRPG distributed retention domain 


For short-term SRPG support the slave latches and associated 
clamping domains must be kept powered. For deep sleep this 
domain (shown with gray overlay) would be power-gated off as 
well (state lost PG, potentially requiring software). 


Voltage scaling of the state retention rail is attractive to provided 
an extended SRPG mode of operation, but simple techniques such 


as adding а Vt-drop that was safe at higher-voltage process nodes 
[4] do not provide sufficient safe state-integrity margin for latch 
structures on sub 90nm technologies with higher inter-device 
variation on latch feedback structures. Figure 3 shows the addition 
of а Boosted-Gate “drowsy” retention to the buffered SCCMOS 
power-gate of Figure 1 where the raised-voltage Gate Bias supply 
provides additional headroom to the scaled retention voltage. 
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Figure 3: SCCMOS with Boosted-Gate retention 


State retention needs to be 100% robust and reliable in the presence 
of power-gating transients and noise from neighboring blocks that 
share а common ground or supply. The underlying retention 
registers need to be designed to balance retention leakage power 
with safe retention latch structures. The poster describes the 
experimental structures designed and implemented to evaluate and 
characterize the integrity of retention registers at reduced voltages. 


4. TECHNOLOGY EVALUATION 

Figure 4 depicts the layout of the test silicon implemented to 
validate the physical IP cell abstractions and EDA flow 
compatibility. 


Figure 4: Advanced ARPG test silicon (TSMC65LP) 


Due to small silicon die-size availability for rapid prototyping (2 x 
2mm!) small ARM® Cortex-M0 TM CPU macro-cells were chosen 
and constrained for performance to a worst case corner signoff at 
330MHz, on the 65nm Low-Leakage technology. Five matched 
pairs of CPUs were instantiated with the 4 pairs on the right of the 
layout to evaluate standard PG and SRPG implementations plus the 
enhanced retention ARPG and SCCMOS plus DRPG gate bias 
implementations. The “tracking-pair’ approach allows the 
implementations to be evaluated at 400MHz-* with each CPU of a 
pair having critical paths stressed in even and odd clock cycles, 
while the main SOC runs reliably zero-waits state at 200MHz. 
Finally, the chip includes state integrity structures in the lower-left 
layout to analyze state integrity and reliability in the presence of 
switching noise and power gating inrush. 
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Building on basic Power Gating (as inferred Бу СРЕ, UPF) 
to address near-zero-overhead State Retention Power 


Gating, enhancing CPU leakage mitigation modes 


= Power Gating, PG, is well supported by Electronic Design Automation tools 

* Based on logic-level drive of Header or Footer “МТ-СМО5” switches 
= State Retention Power Gating, SRPG, is also richly supported in the power intent standards 
This work builds on industry standard EDA flows to allow more expert implementation techniques 
to be utilized without resorting to full-custom design 
1. To support gate-bias (overdrive) techniques to enhance power gating 

* Super-Cutoff CMOS, SCCMOS, gate drive to power switches 


* Header power gates in this case 


2. To provide advanced State Retention optimized for performance and minimal area impact 
* Advanced SRPG compared to the conventional EDA preferred register level abstraction 


IP Deployment 
Reference designs 


= Building on UPF and CPF PG inference 

= Well supported in Multi-Voltage tools 
* Methodology and design flow "proving" 
= Representative of real-world designs 


= In terms of multi-voltage challenges 


* But quick to design/validate/fabricate/analyze 


= e.g. only 5M transistors, tiny 3.5mm? area 
= But 13 power domains, З VDD rails 


= Analog pads to observe “virtual rails 


= Experimental structures to analyze integrity 


= Artisan® libraries plus R&D prototype cells 


1000 


Leakage Power (nW) 


Evaluation and 
characterization platform 


SoC architecture 


Retention integrity 
analysis structures 


* Reconfigurable register 


(synthesized) arrays 


* 2-D parity analysis 
= BankO or Bank 1 


* Level-shift scan chains 


* First fail detect 


* Interrupt on error 


= Controlled noise / retention 


* With noisy other bank 


* To analyze VRET scaling 


* Real-time compare 
= Level shifted detect 
* for voltage sensitivity 
= X-Y of first failing bit 


(raise voltage/wake/check) 


The Architecture for the Digital World® 


100090 2:2: 


* Based on mature technology - TSMC65LP process "Tokachi-1" reference silicon 


Technology demonstrator 


= Including academic research (University of Southampton, UK) 


= With acknowledgement to EU EuroPractice "Mini-ASIC" research program 


* Built using multiple instances of ARM СопехФ-М0 processor 
= 14 CPUs with опе as a primary system MCU, 200MHz sign-off 


= Including 5 pairs optimized for performance (330MHz worst-case sign-off) 


* Built using standard RTL and power intent 
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Can count cycles to error 
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Super Cutoff power gating / 
Boosted-Gate drowsy retention 


“Standard Cell” abstractions of 
Super-Cut-off Power Gating 


Li Enhanced turn-off 

. Smaller Standard-Vt switches 
Boosted Gate drowsy retention 

= Very small Standard-Vt switches 


LI Safer than full “diode-drop” Vret 
Clean EDA deployment Е 


(йор area ир to 1.3x+) 
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= Experimental macros clocked 


at double speed — 


400MHz 


* Fast-slow every other cycle to 
stay in-step with reference 


CPU 


* Pairs required to ensure every 
instruction is run at-speed 


= "Live-Slave" flops 
= Dual-rail 
= Guarded clock 


= Asynchronous Reset 
= As above 


= Guarded reset 


= Clock Gate cells 


= Protect the clock 


Drowsy voltage vs Gate Bias 


Addressing overheads of 
conventional SRPG 


Retention cost at every flop... 


Low-impact SRPG 


+ Alternative to the EDA preference for Retention 
abstracted per-flop 
+ With the associated area and performance cost 


* “Clamp” (low) the clock from the final clock-gate 
+ And insert а dummy clock gate where there is по 
gating 
+ Share the cost of clamp across local cluster of flops 
+ And "live with this" more expert "flow" in order to reap. 
the benefits 


* Reset also requires clamping 
+ For asynchronous-reset flops 
* (without breaking test tools...) 


HPSRPG Physical-IP 
abstraction 
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This paper describes the prototype implementation of the DySER specialization architecture integrated into the OpenSPARC 
processor. Тһе paper’s description covers the hardware, compiler, and application tuning. The prototype system provides 
speedups up to 14X over OpenSPARC (geometric mean 5x). The architecture is more flexible than SIMD and GPU- based 


acceleration while supporting a more diverse set of workloads. 


Overview Future processors must improve microarchitectural 
efficiency in order to overcome slowing transistor energy effi- 
ciency and sustain performance growth. The DySER architec- 
ture uses dynamic specialization to provide energy efficient 
performance improvements by complementing conventional 
processors. By using a co-designed hardware-compiler ap- 
proach that avoids disruptive hardware or software changes, 
the architecture Dynamically Specializes Execution Resources 
to match application phases and achieves both functionality 
specialization (like Garp, Chimaera, Conservation-Cores) and 
parallelism specialization (like GPUs and SIMD short-vector 
extensions). We describe here the DySER architecture and its 
execution model, design and implementation of its compiler, 
prototype implementation, and conclude with performance re- 
sults and significance of this work. 


Architecture DySER is an array of configurable functional 


units connected with a circuit switched network of simple switches 


as shown in Figure 1. A functional unit can be configured 
to get its inputs from any of its neighboring switches. When 
all its inputs arrive, it performs the operation and delivers the 
output to a neighboring switch. Switches can be configured 
to route their inputs to any of their outputs, forming a circuit 
switched network. With this configurable network of func- 
tional units, a specialized hardware datapath can be created for 
a sequence of computation. To enable pipelining and dataflow 
like execution, both switches and functional units implement a 
simple credit based flow control that ensures data is forwarded 
only when the credit is available. Credits are generated when 
a functional unit/switch can accept new data. The switches 
in the edge of the array are connected to FIFOs, which are 
exposed to the processor core as DySER’s input/output ports. 
DySER is tightly integrated with a general purpose processor 
pipeline, and acts as a long latency functional unit that has a 
direct datapath from the register file and from memory. The 
processor can send/receive data or load/store data to DySER 
directly through ISA extensions. 


Execution Model Figure 2 shows DySER’s execution model. 
Before a program uses DySER, it configures DySER by pro- 
viding the configuration for functional units and switches. Then 
it sends data to DySER either from registers or from memory. 
Once data operands arrive at DySER’s input FIFOs, they fol- 
low the configured path through the switches. When the data 
operands reach the functional units, the functional units per- 
form the operation in dataflow fashion. Finally, the results of 


the computation are delivered to the output FIFOs, from where 
the processor fetches the outputs and sends them to the regis- 
ter file or to memory using ISA extensions. Further details are 
here [3, 2]. 


Compiler Design and Implementation DySER’s compila- 
tion consists of four main phases and the key mechanism we 
leverage is the development of a new program representation 
called the Access-Execute Program Dependence Graph (AEPDG) 
that exposes the spatial and temporal aspects of dependences 
to the compiler. The four phases are : i) Selecting regions from 
the full program Program Dependence Graph (PDG) that are 
candidates for mapping to the DySER hardware. ii) Formation 
of the basic AEPDG encapsulating those code regions. Ш) 
AEPDG transformation and optimizations to meet the good- 
ness characteristics for the DySER architecture. iv) Code gen- 
eration of the AEPDG. Our compiler implements a set of judi- 
ciously chosen and intuitive heuristics to produce good quality 
code as part of the transformations and optimizations phase. 
These are: 


Loop Unrolling/PDG Cloning 

Strip Mining/Vector Deepening 
Subgraph Matching 

Execute-PDG Splitting 

Scheduling Execute-PDG 

Loop Unrolling/Dependence Analysis 


e. 
LJ 
LJ 
e. 
LJ 
e. 
e Traditional Loop Vectorization 
LJ 


Load/Store Coalescing. 


To implement our compiler, we leverage the LLVM com- 
piler framework and its intermediate representation(IR). We 
have developed LLVM optimization passes that process the 
LLVM-IR to construct the AEPDG and apply the associated 
transformations. Finally, we extend the LLVM code-generator 
to assemble DySER instructions and configurations. 


Prototype Implementation We have completed a full RTL 
implementation of the DySER architecture integrated into the 
OpenSPARC pipeline. In terms of physical design, we have 
synthesis based results. The DySER block occupies an area of 
1.54 mm? using a 55nm ASIC library, and on average con- 
sumes 72 mW. 
In terms of implementation complexity, our prototype shows 

the DySER design is practical. The final interface consisted of 
only 11 signals in the RTL between OpenSPARC and DySER, 
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Figure 2: DySER Execution Model 


and a total of less than 750 lines of code modified in OpenSPARC. 


Further details are here [1]. 


We have also completed a mapping to FPGA of the full 
design using the Vertex-5 board. This FPGA implementation 
boots unmodified Ubuntu 7.10 Linux and runs C/C++ pro- 
grams compiled through our toolchain. For detailed perfor- 
mance evaluation on our FPGA prototype, we implemented 
several FPGA optimizations to the architecture. These include 
simplifying the switches, "hardening" the configuration infor- 
mation and creating a FPGA bit-file specific to each applica- 
tion, and simplifications to the load-store interface. 


Performance To measure DySER's efficiency in specializa- 
tion, we have evaluated its performance on a suite of SIMD 
and GPU workloads to capture its functionality and parallelism 
specialization capability. We compare performance of DySER- 
accelerated implementations of these benchmarks to the se- 


quential OpenSPARC implementation, and hand-optimized SIMD 


and GPU implementations. Based on measurements on our 
FPGA prototye implementation, compared to the OpenSPARC 
baseline, the DySER prototype, provides a speedup of up to 
7x, with a geometric mean speedup of 3x on this diverse 
benchmark suite. Adding a vectorized mode to DySER pro- 
vides up to 14x speedup with a geometric mean speedup of 
5x. We observe that OpenSPARC’s single-issue pipeline is 
the main bottleneck throttling the rate at which DySER is fed. 
When integrated with a dual-issue out-of-order processor, re- 


sults from our cycle-accurate performance simulator show DySER 


continues to provide similar speedups: up to 14x, with a geo- 
metric mean speedup of 3.5 x. As elaborated in [2], compared 
to SSE, DySER provides geometric mean 2.5 х speedup, and 
compared to GPU execution, it provides 1.2x speedup. 


Implications and Significance DySER is the culmination and 
generalization of trends already occurring for popular paral- 


lelism based accelerators. SSE has been augmented with both 
functionality-specialized and non-purely word parallel instruc- 
tions. Instructions in NVIDIA Kepler GPUs are specialized 
for the particular region with compiler annotations indicating 
when to issue. 

Not only is DySER a more natural evolution of special- 
ization strategies, but it is also more practical to implement. 
From a software perspective, it is a more flexible compiler tar- 
get than SSE, and DySER does not require a new software 
stack and application implementations as for the GPU. From a 
hardware perspective, its interface enables simple integration 
with a processor pipeline. 

The most profound implication of DySER is that the exe- 
cution model and architecture provide a practical way to im- 
plement instruction-set specialization, SIMD specialization, and 
domain-driven accelerators using one substrate. With its im- 
pressive speedup and corresponding energy gains, DySER sig- 
nificantly improves architectural energy efficiency using spe- 
cialization. The novel architecture, its prototype implementa- 
tion, and energy efficiency implications of the execution model 
provide a set of promising mechanisms. 
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DySER Approach 
° Compiler assisted dynamically specialized 
computation through heterogeneous array 
of functional units 
° DySER configured once for multiple invocations 
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OpenSPARC T1 Integration 
° Limited ISA extensions required 
d init, d send, а recv, 
d store 
* Only eleven interface signals in RTL/microarch 
* Few lines of changed Verilog code: 


d load, 


Lines 
Changed 


IFU 


275 Reserved opcodes used for DySER Instructions 
LSU 23 Reverse engineered memory control 
EXU 216 DySER model Verilog and 18 FF added 
MMU 0 Unchanged! 
Total 514 Minimal changes! 


DySER Architecture 
е Alongside Execute stage in processor pipeline 
° Concurrently executes DySER and non-DySER 
code 
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FPGA Prototype 
° Utilizes Xilinx VirtexS FPGA Board 
° Fits a "hard" 4x4 DySER with fixed paths 
° Boots unmodified OpenSPARC Ubuntu 7.10 
° DySER is not on the critical path! 
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ASIC Synthesis @ 55nm 
Area: 1.54mm2 Power: 72mW 


Prototyping the DySER Specialization Architecture with OpenSPARC 


Jesse Benson, Ryan Cofell, Chris Frericks, Venkatraman Govindaraju, 
Chen-Han Ho, Zachary Marzec, Tony Nowatzki, and Karu Sankaralingam 
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DySER Compiler 
° LLVM based compiler 
° Generates specialized binaries 
from C/C++ source code 


for DySER 


Scalar Code DySER Code Scheduled DySER 
for (i20; i < n; i++) í Hpragma dyserize |, 
if (a[i] > 0) for (1=0; i < n; i++) í 
c[i] = 1 / b[2i]; ——— if (a[i] > 0) 
else c[i] = 1 / b[2il; 
c[i] = b[2i] * 2; else 
c[i] = b[2i] * 2; 
} 
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DySER Performance 
° Throughput/high-performance workloads 
° Competitive or surpasses SIMD/GPU approach 


2.5x speedup 
4.9x speedup 


Non-vectorized (1 wide) : 
Vectorized (8 wide): 
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Low Power and High Performance 3-D 


Multimedia Platform 

Po-Han Huang, Chi-Hung Lin, Hsien-Ching Hsieh, 
Huang-Lun Lin апа Shing-Wu Tung 

Information and Communications Research Lab. 
Industrial Technology Research Institute 
Hsinchu, Taiwan 

E-mail: pohan @itri.org.tw 


Traditional technology scaling of semiconductor chips 
followed Moore's Law. However, the transistor 
performance improvement will be limited, and 
designers will not see doubling of operating 
frequency every two years. Recently, 3-D integrated 
circuits that utilize through silicon via (TSV) for 
interconnection have been developed as an improved 
alternative to the Package-on-Package (PoP) and 
System-in-Package (SiP) packages. There are many 
benefits by using TSV-based 3-D integration 
technologies: (1) Circuit delay can be improved due to 
the shorter interconnect and reduced parasitic 
capacitance/inductance, (2) more functionality can be 
integrated into a small silicon space for form factor 
reduction and higher packing density due to the 
additional third dimension, (3) different components 
with incompatible manufacturing process (i.e. Logic, 
DRAM, Flash, etc) can be combined in a single 3-D IC 
for heterogeneous integration. The 3-D integration 
based on TSV technology enables stacking of multiple 
memory layers to obtain higher bandwidth for the 
recent multimedia applications at lower energy 
consumption. Intel has demonstrated through the 
teraflops microprocessor chip which is an 80-core 
design with memory-on-logic architecture. And, each 
core connects to a 256KB SRAM with 12GB/s 
bandwidth. Although 3-D IC overcomes many 
limitations and drawbacks on 2-D IC design, it still has 
many challenges and design issues that should be 
considered carefully. In general, the number of TSV is 
the most critical constraint while designing a 3-D 
architecture because it is highly related to system 
performance. 


This poster presents а 3-D multimedia platform — 
3D-PAC designed by ITRI. 3D-PAC is developed by 
stacking original 2D-PAC with the SRAM tier. Based on 
this 3-D stacking technology, the performance can be 
enhanced about пеагіу 54% according to different 
applications. And this poster will show the method of 
architecture exploration for 3-D stacking. It also 
describes the detail implementation of reconfigurable 
SRAM and tier selection when multi-layer stacking 
SRAM is needed. Finally, the chip is fabricated in 
TSMC 90nmG CMOS technology. 2D-PAC has the 
novel features which are described as follows. It is a 
heterogeneous multi-core architecture, composed of 
ап АКМ926ЕЈ-5 and two PACDSPs (variable-length, 
5-way VLIW architecture designed by ITRI/ICL). This 
system also consists of three different kinds of buses: 
АХІ, AHB and APB. There are also many peripherals 
implemented in the system such as c, UART, ..., etc. 
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In general, the new architecture evaluation for ап 
optimized stacking static memory is driven by area, 
performance, energy efficiency, number of TSVs and 
thermal issues. After architecture exploration using 
the electronic system level (ESL) design, the stacking 
memory is integrated with the instruction memory 
unit (IMU) and data memory unit (DMU) of PACDSP 
core because of the performance and number of TSV 
to build up so called 3D-PAC platform. 


3-D Stacking Memory 


Еее | — m ыы 


IMU || DMU IMU || DMU 
PACDSP 1 PACDSP 2 
DMA DMA ЕМРМА 
BIU BIU 


AXI Interconnect 


X2H 


Bridge SRAM 


DDR2 


AXI Sub-system 


For this architecture, each PACDSP owns its private 
SRAM block (256KB) which is reconfigurable. It allows 
programmers to configure the different architectures 
of stacking memories for different applications. For 
example, a user can configure a part of memory as 
instruction memory and the rest as data memory or 
make the entire memory as data memory, which 
provides high flexibility for the original architecture. 
With the 3-D stacking SRAM, it equally extends the 
instruction cache and data memory size of PACDSP. 
That means programmers can profit in two ways: (1) 
Reduce the latency caused by cache misses with 
increasing the size of instruction cache, (2) reduce the 
frequency of external memory accesses with 
increasing the size of local data memory. 


Two real applications, multi-channel H.264 decoder 
and JPEG decoder, are chosen to analyze the impact 
and efficiency from extending PACDSP local memory 
by 3-D stacking. Experiments are executed on the ESL 
platform mentioned before. In Multi-channel H.264 
decoder application, it takes two PACDSPs to decode 
four different films simultaneously and display on LCD 
screen at the same time. It requires more data 
movement and computation power compared to the 


single film decoding. 

Experiment configurations (H.264 decoder): 

ARMO = 204MHz, AHB = 102MHz 

PACDSP = 204MHz, АХ! = 204MHz 

DDR2 data rate = 408MHz 

Bitstream: 

SHISEIDO_track1, 

Jomo_track2, 

ісе Age, 

5ТС TEST Motion 

(10 frames, QVGA) 

Experimental results show that overall performance 
can improve from 11.56fps to 14.80fps (28.02%) 
without applying any parallelism and optimization for 
H.264 decoder application. Each DSP is in charge of 
the decoding of two films. Without 3-D stacking 
memory, PACDSP needs to backup the internal data 
to external DDR2 memory during the film switching 
because of lack of internal data memory, and it incurs 
a huge overhead. Ву contrast, with enough 3-D 
stacking memory supporting (each 256KB), PACDSP 
does not have to backup related data during film 
switching. So that will make a huge improvement for 
system performance. After applying some parallelism 
and optimization to multi-channel H.264 decoder, the 
system performance can reach 26.09fps (54.19% 
improvement compared with 2D 16.92fps). 


v v v v 


H.264 Version Total Cycle Improve. 


the original 64KB data memory is not enough for this 
application. According to this analysis, it seems that 
suitable memory configuration depends on different 
applications. 3D-PAC platform maintains this design 
feature. Each PACDSP owns its private SRAM block 
(256KB) which is reconfigurable. It allows 
programmers to configure the different architectures 
of stacking memories for different applications. 
Execution Type Ratio 
Cache Size : 64KB 
Cache miss 
External access 
DSP computation 
Total cycle count 
Cache Size : 128KB 
Cache miss 
External access 
DSP computation 
Total cycle count 
Execution Type Cycle Count 
Data Memory Size : 64KB 


Cycle Count 


129,314 
550,498,586 
90,676,349 


0.02% 
85.84% 
14.13% 

641,304,249 


9,376 
550,494,242 
90,525,505 


0.00% 
85.87% 
14.12% 

641,029,123 


Cache miss 


External access 550,494,242 85.87% 


DSP computation 90,525,505 14.12% 


Total cycle count 641,029,123 


Data Memory Size : 64+192KB 


0.00% 


xternal a 8 6 


Cache miss 


DSP computation 102,449,710 98.02% 


Total cycle count 104,509,773 


There are total 1,914 TSVs іп 2D-PAC allocated in the 
middle area. Related chip SPEC (both 2D-PAC and 3-D 
stacking SRAM) and die photo shows as follows: 


Design 2D-PAC 
Process TSMC 90nmG 
Operating Frequency PACDSP 300MHz 


Operating Voltage 


Core: 1.071.2V 1/0: 3.3V 


Non-Parallel Version # 1/0 Pads 498 with PWR/GND 
2D 194,019,671 : Die Area 7880 x 7880 um” 
3D with 4 binary 151,524,469 28.02% 

Parallel Version Design 3-D stacking SRAM 
2D 164,531,751 i Process TSMC 90nmG 

3D with 2 binary 121,891,167 34.89% Operating Frequency 300MHz 


Parallel Version (Enhanced) 
2D 132,557,335 
3D with 2 binary 86,001,705 


Experiment configurations (JPEG decoder): 

> ARMO = 204MHz, AHB = 102MHz 

> PACDSP = 204MHz, АХ! = 204MHz 

> DDR2 data rate = 408MHz 

> Bitstream: test_image (1 frame, QCIF) 
Experiment result shows that there is only little 
system improvement by extending the instruction 
cache for JPEG decoder because of the low ratio of 
cache misses (0.02%). It stands that the original 64KB 
cache is enough for this application. By contrast, 
there is huge performance improvement by enlarging 
the local data memory for JPEG decoder because of 
the high ratio of external accesses (85.87%). It means 


Operating Voltage 


Core: 1.0^1.2\/ 


Die Area 


3880 x 3880 um* 


3D-PAC (2D-PAC stacking with 3-D SRAM) die photo: 
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Final architecture of 3D-PAC 


Low-Power and High-Performance 3-D Stacking Multimedia Platform 
Industrial Technology Research Institute (ITRI), Taiwan 


platform is proposed Extended SRAM accessed through TSV 
(Multi-layer stacking supported) 


SMM  SD-PAC for system bandwidth 
and performance improvement 
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neglectable 
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have been utilized for 3D stacking 


2D-PAC with various low-power techniques 


VDDEN_DSM (VS) 
VDDE N_DSPI (VS) 


Case2: Stacking on DSP Bus Interface 
Casel: Stacking on AXI Bus 


Case3: Stacking on DSP Instruction/ 
Data Memory Unit (IMU/DMU) T 


Configurations: 

O ARMS = 204MHz, AHB = 102MHz 

E] PACDSP = 204MHz, AXI = 204MHz 

О DDR2 data rate = 408MHz 

O Application: Н.264 decoder 

O Bitstream: foreman (30 frames, QCIF) 


Performance/Cost Evaluation with ESL 


Both DSPs can be 
turned off or supports 
voltage scaling (DVS) 


Architecture Total Cycle Improve. 
2D-PAC (DDR2) 164,531,751 13.64 
3D-PAC (3-D SRAM on AXI bus) 162,312,919 13.83 1.39% 272 
3D-PAC (3-D SRAM on BIU) 162,308,407 13.83 1.39% 544 
"4 3D-PAC (3-0 SRAM on IMU/DMU) 121,891,167 18.40 34.89% 1,886 
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Heterogeneous multi-core architecture: 


О ARM926EJ-S 


П Two PACDSPs (variable-length 5-way VLIW) 
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Low power architecture design flow: 
Common Power Format (CPF) by Cadence Inc. 


Suitable memory configuration depends on different application => Reconfigurable 3-D stacking memory 


Configurations: 

O ARM9=204MHz, AHB = 102MHz 

O PACDSP = 204MHz, AXI = 204MHz 

O DDR2 data rate = 408MHz 

O Application: JPEG decoder 

O Bitstream:test image (1 frame, QCIF) 


Case1: Cache ве 


Execution Type 


Cycle Count Ratio 


Cache Size: 64KB 
Cache miss 129,314 | 0.02% 


External access 550,498,586 | 85.84% 
90,676,349 | 14.13% 


641,304,249 


DSP computation 


Total cycle count 
Cache Size: 128KB 
Cache miss 9,376 | 0.00% 


External access 550,494,242 | 85.87% 
90,525,505 | 14.12% 


641,029,123 


DSP computation 


Total cycle count 


Case2: Data memory size 


Execution Type — Cycle 


Data Memory Size: 64КВ 


Cache miss 9,376 | 0.00% 
r— External access 550,494,242 | 85.87% 


90,525,505 | 14.12% 


DSP computation 


Total cycle count 641,029,123 
Data Memory Size: 644192KB 
Cache miss 8,865 | 0.00% 


+] External access 2,051,198 | 196% 
102,449,710 | 98.02% 


104,509,773 


DSP computation 


Total cycle count 


Design SPEC: 

П Private SRAM block (256KB) for each DSP 

П Reconfigurable SRAM as internal instruction 
or data memory of DSP 


Implementation Results 


Logic Layer Memory Layer 
Design Design 3-D stacking SRAM 

Process Process 

PACDSP 300MHz 


Operating Frequency Operating Frequency 


Operating Voltage 


Core: 1.0°1.2¥ 1/0: 3.3V 
498 with PWR/GND Die Area 
Die Area 7880 x 7880 um? 


Operating Voltage 


# VO Pads 3880 x 3880 ит? 


Heterogeneous integration with total 1,914 TSVs 
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1. Introduction 


Although battery life has always constrained embedded and mo- 
bile hardware developers, the rise of smart phones and tablets has 
also made energy a fundamental concern of software developers. 
On the desktop, software developers generally ignored energy, 
but in the mobile environment, battery life is critical to the user 
experience. Just as developers use performance profiling tools, 
they now need energy profiling tools to understand how and why 
their software consumes energy. 

The inconvenience, cost, and complexity of external power 
measurement hardware and the inaccuracy of on-board power 
sensors on older phones [1] motivated researchers to create power 
modelling tools. Power modelling uses utilization metrics to esti- 
mate power draw based on previously measured correlations be- 
tween the metrics and power. 

We show that the on-board power sensor is now accurate on a 
Windows Phone 7.5 device running on a SnapDragon MSM8660. 
Compared to external measurement hardware, the on-board “fuel 
gauge” is accurate to within 2% of total energy consumption. 
We thus modify the Windows Phone 7.5 OS to sample power 
without external hardware, sample the application call stack to 
correlate energy consumption with code, and examine power 
traces from two weeks of normal use. These traces illustrate 
behavior where modelling alone is not sufficient to understand 
the energy consumption of a mobile device. For example, we 
observe inter-day variations in base power draw as the battery 
discharges, an effect that to our knowledge is not captured by 
existing modelling work. 

This work recommends that a hybrid approach will improve 
the accuracy of energy profiles, and that direct measurements 
will significantly improve the accessibility of fine-grained energy 
information in both testing and deployment. Armed with easy- 
to-use energy analysis tools, hardware designers, OS developers, 
and third party application developers will be better equipped to 
understand and optimize the energy behavior of mobile code. 


2. Related Work 


Early energy modelling research used power measurements of 
executing each machine instruction [4, 5]. These methods do not 
extend well to modelling the power draw of other non-CPU com- 
ponents, which constitute two thirds or more of energy consump- 
tion on mobile devices. 

For mobile devices, recent models use linear regression 
trained on energy profiles of the entire device gathered from sce- 
narios that stress each device component [3]. This model is ac- 
curate compared to external measurements on short-duration test 
runs. This approach, however, cannot address the tail power state 
problem, where components improve responsiveness by waiting 
in a higher power state for more work to arrive (see Section 4.1). 

To provide fine-grained energy accounting, Pathak et al. [2] 
introduce a finite state machine that models component power 
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draw by tracing system calls. For example, a read system call 
transitions the flash storage component into a higher power state. 
This approach is promising for attributing energy consumption 
to code. However, it does not model power draw of key device 
components, such as the screen or application CPU draw, and so 
cannot present a complete picture of the device’s energy usage. 

Dong and Zhong [1] overcome these problems with a hybrid 
approach. They create models on-the-fly using the on-board bat- 
tery sensor. They use low-frequency samples from the sensor to 
track accuracy and trigger model reconstruction if the variation is 
too high. This method potentially accounts for variation in base 
power draw, but with a significant cost or latency. However, it 
does not address the tail power state problem. 


3. Measuring energy consumption 


External measurement hardware, such as the Power Monitor with 
which we compare in this paper, accurately measures power 
draw from a phone’s battery. These tools, however, are relatively 
expensive ($750, which is more than the phone itself) and limit 
phone mobility, restricting real world testing. 

On modern mobile devices, the “fuel gauge” (FG) provides 
accurate readings of battery voltage and instantaneous current 
draw for displaying remaining battery life. To validate the FG’s 
accuracy, we executed benchmarks on а HTC (MSM8660 pro- 
cessor) device running Windows Phone 7.5, simultaneously cap- 
turing power measurements from the FG and an external Power 
Monitor. The FG was accurate to within 2%+0.02 of the Power 
Monitor. We suggest this accuracy is enough to replace external 
hardware as the source of power measurements. 


4. Power modelling 


The main challenge in on-board energy profiling is accurately 
attributing the energy consumed by particular applications and 
methods. The traditional approach to this problem has been mod- 
elling, which can both produce power readings and attribute them 
to code entities. 


41 Tail power states 


Even with a model, tail power states complicate energy profil- 
ing. To provide responsiveness, many components (e.g., radio, 
GPS) continue to draw high power after use. For example, a 3G 
radio may remain in a higher “tail” power state for up to seven 
seconds after use. This tail power state complicates energy at- 
tribution: the download has completed, the code has moved on, 
and the application may no longer be running, but power is still 
drawn. Further, if several applications use a component, which 
one should be charged for the tail power state? 

Pathak et al. [2] model tail power states with their system call 
model, record the calling context of each system call, and assign 
the tail power to the last calling context that used the device. 
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(b) Day 2: Idle power 0.23 W 


Figure 1. Two days of power usage with day-to-day variation. 
Points are coloured by active application. 


4.2 Shortcomings of modelling 


There are two main drawbacks to modelling: 


1. Accuracy is limited by the training environment and compo- 
nents modelled (e.g., CPU and screen are often missing). Ве- 
cause the mobile environment is so diverse (hardware models, 
cellular networks, etc.), modelling has the potential to be very 
inaccurate. 


2. The trade-off between latency, cost, and portability essen- 
tially limits models to testing. Shipping system call traces to 
a server introduces long latency, but running models on-board 
has high overhead and can require external hardware, imped- 
ing portability. 


Our experiments also show a significant day-to-day variation 
in base power draw when the phone is “idle.” Figure 1 plots two 
days of power readings, revealing a variation of 0.19 W in base 
power draw between two consecutive days of usage (the dashed 
line at base of each figure). A point (x,y) on this graph plots 
the power in watts (у) as a function of a day’s usage (time on 
the x axis). Day 1’s base draw is 0.42 W while Day 2's is 0.23 W, 
which is a significant difference in total energy consumption over 
several hours. Unless this variation is observed in the lab, a model 
is unlikely to predict this day-to-day variation. 


5. А hybrid approach 


We advocate a hybrid approach to accurately attribute energy 
consumption to the code that uses it: accurate battery sensors 
(like the FG) to gather online power readings, and system call 
modelling to handle components with tail states. 

This approach has several benefits. First, it ensures that an 
energy profile always shows a complete picture of system energy 
consumption. By always capturing whole system power readings, 
components acting in ways unanticipated by the model do not go 


unobserved. While precise attribution is challenging, the recent 
applications and their call stacks will help identify problems, par- 
ticularly if applications are sampled over many days. Real mea- 
surements at the very least identify power anomalies, and that 
the model may need correction. Second, a hybrid approach deals 
with tail power states in a way measurement alone cannot. For 
example, it can accurately assign the energy usage of networking 
hardware to the code making network calls, providing a more ac- 
tionable energy profile to a developer. Finally, if we use on-board 
sensors, we can avoid the need for expensive external hardware, 
improving the accessibility of energy information to developers. 

This approach also opens a new field of potential optimization 
at the OS level. With completely online power measurements, the 
OS may monitor battery life performance at a fine-grain in real 
time, and perform optimization to achieve a battery life goal. For 
example, if an alarm is set for 8 hours in the future and projecting 
current energy consumption indicates that the battery will not last 
that long, the OS can optimize components to meet this power 
goal. This approach improves over the current practice, in which 
the OS takes drastic measures when the battery is very low and 
it is too late to recover. Guided by the component attribution 
possible with modelling, the OS may determine that the GPS is 
consuming too much energy and tune it to use less energy by 
providing less accurate positions. Gathering the data must have 
low overhead (less than 4% in our testing), such that it does not 
contribute to the problem. The OS needs fine-grained real-time 
low cost power data to make these types of optimizations. Our 
tool provides such data. 


6. Conclusion 


Mobile devices are placing energy efficiency in the hands of soft- 
ware developers. Unfortunately, power modelling alone cannot 
identify significant power variations, nor is it practical in deploy- 
ment. We show that on-board fine-grained power measurements 
of the fuel gauge are now both accurate and low overhead. 

We advocate a hybrid approach to power measurement: com- 
bining the best aspects of both modelling and measurement to 
produce accurate and actionable energy information for develop- 
ers. Using such tools, developers have the potential to understand 
and optimize the energy behavior of their code. Operating sys- 
tem developers have the potential to guide real-time optimization 
in the interests of battery life, a significant advancement over the 
current state of power optimization. These advances are critical to 
the future of mobile software, as developers at all levels come to 
terms with their new-found responsibility for energy efficiency. 
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The Model Is Not Enough: 


Why understand energy? 


smart phones and tablets force software developers to focus on 
battery life. 


Developers already optimise performance using profilers. Our 
goal is to build an energy profiler. 


Why is energy profiling difficult? 


The two ways to calculate energy—hardware power meters and 
software modelling—have drawbacks. 


Attribution of energy to code is made difficult by tail power states, 
which shift the blame. 


Tail power states: who’s to blame? 
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Figure 1 The 3G radio’s tail power state consumes energy while the CPU is idle. 
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Figure 2 Tail power states avoid returning to the ramp 
up state, therefore reducing latency. 


A deeper look: energy profiling 


Why understand energy? 

e Software and energy interact in subtle and non-trivial ways — for example, grouping of 
network requests to avoid thrashing the network hardware 

“ Energy bugs present serious usability problems, because users cannot easily identify 
them until it is too late (i.e. the battery is empty) 

Why is energy profiling difficult? 

e Hardware meters are often expensive and bulky 

e Models are specific to their training environment 

“Та! power states reduce ramp-up latency by allowing components to remain powered on 
after last use 

* So it’s wrong to attribute current power draw to the currently executing code — the real 
culprit may be long gone! 


Understanding Energy Consumption in Mobile Devices 


Power modelling 
Modelling uses metrics such as CPU and network usage to 
extrapolate power draw, but this has issues for profiling use. 


Hecent work uses system call tracing to handle tail power states, 
but other issues remain. 


Power measurement 
Hardware power meters are common for power measurement, 
but these are impractical for most developers. 


Our work shows the onboard "fuel gauge" battery sensor is 
accurate to within 296 of external meters, but measurement alone 
cannot address tail power states. 


Modelling misses important variation 
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Figure 3 Modelling misses a base power variation, worth 30% of battery life over 8 hours. 


A deeper look: modelling and measurement 


Power modelling 
“Тһе raw numbers are often quite good: errors of below 10% 
e But the models: 
1. Do not address tail power states 
2. Are specific to the lab environment they are trained in 
3. Cannot detect the unexplained power variation in Figure 3 
“Іп system call tracing, finite state machines model each component's power states, 
using system calls to transition the machine 
* This addresses the tail power state problem, but suffers the other drawbacks of 
modelling, and also cannot yet capture important components like the screen 
Power measurement 
* External meters are expensive and bulky, making them inaccessible and difficult to use 
“Тһе fuel gauge typically measures battery capacity, but modern sensors provide 
instantaneous power draw 
e When testing an HTC Windows Phone, the onboard sensor was accurate to within 2% 
of total energy compared to the power meter 
* [he overhead of sampling the fuel gauge at 5 Hz was less than 4%. 
* [his means the fuel gauge is accurate enough to replace the external power meter for 
many uses, but may not be appropriate for very high frequency sampling 
* [he power meter we used samples at 5000 Hz 
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A hybrid approach 


Based on our results, we advocate a hybrid approach to energy 
profiling. 


This approach makes energy profiling more accurate, more 
actionable, and more accessible. 


What do we get from this data? 


Energy profiles let developers isolate poor energy usage in their 
code. 


Energy profiles open a new field of potential online OS-level 
energy optimisations. 


Energy profiles identify real bugs 
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Figure 4 An energy bug in the MSN website, seen and diagnosed by an energy profiler. 
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A deeper look: hybrid energy profiles 


A hybrid approach 

* Accurate onboard sensors for online, whole-system power readings, and system call 
modelling to handle tail power states 

* Energy profiles will always show a complete picture of system energy use, rather than 
omitting unmodelled components, making profiles more accurate 

* Tail power states are handled, making profiles more actionable 

“Бу removing the need for hardware, even for training, profiles are more accessible 

What do we get from this data? 

“Іп Figure 4 we see MSN as a clear outlier in energy use 

“Тһе profile identifies excessive image preloading as the cause of this inefficiency 

“Тһе OS can use online profiles to achieve battery life goals; for example, tuning energy 
use to ensure a scheduled alarm goes off 

* Guided by component attribution, this tuning can target specific energy users like the 
GPS, and trade accuracy or speed for battery life 
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1. INTRODUCTION 


Multicore processors are becoming ubiquitous, placing new 
demands on hardware and software designers. No longer do 
a small set of experts develop a few software applications 
for a small number of parallel machines. Already standard 
in servers, desktops and laptops, today handheld devices 
use multicores, expanding the spectrum of their use from 
mobile computing at the low end to cloud computing at the 
high end. Consequently, dramatically increased number of 
software developers are creating hundreds of thousands of 
applications to run on a plethora of diverse platforms. Thus 
ease of writing parallel programs, to achieve energy and/or 
performance efficiency, continues to gain importance. 

At the same time, programmers have to account for the 
changing characteristics of emerging technologies. Proces- 
sors are transitioning from homogeneous cores to hetero- 
geneous cores with disparate performance/energy character- 
istics. As future computing hardware pushes the limits of 
semiconductor technology, it will become increasingly unre- 
liable. Simultaneously, emerging use of computing systems 
will require them to host multiple applications concurrently, 
even on mobile devices. Unreliability, resource (computing 
and energy) management, and service-level agreements will 
lead to imprecise knowledge of available resources during a 
program’s execution. Hence programmers can no longer as- 
sume availability of given (or constant) resources to process 
an application, unlike in canonical parallel programming. 

The confluence of the above factors pose daunting chal- 
lenges to programmers in writing ubiquitous programs and 
achieving their reliable, energy-efficient, parallel execution, 
while remaining agnostic of the unpredictable, dynamically 
(and potentially continuously) changing computing condi- 
tions. 

We propose a model that seamlessly addresses this range 
of challenges. It relies on expressing parallel algorithms 
as sequential programs, 1.е., statically-sequential (§2.1), 
and performing their controlled, dynamic parallel execution 
while honoring their sequential semantics. Although at first 
glance the approach may appear antithetical to parallelism, 
we show that it affords several advantages. Its intuitive inter- 
face and sequentially determinate execution (which ensures 
that in any execution of a program with the same inputs, 
a variable is assigned the same sequence of values) allow 
programmers to easily reason about the program execution, 
simplifying programming. The model utilizes the implied 


order in a statically-sequential program to achieve a dataflow 
schedule of parallel execution (§2.2), potentially exploiting 
all available parallelism. Further, the order permits the adapt- 
ability needed to achieve efficient execution in dynamically 
changing (82.3), unreliable (82.4) computing environments. 
We provide an overview of these aspects and present results 
from our efforts to develop several benchmark applications 
using the model, implemented as a fully functional runtime 
system, on stock multicore systems. 


2. DYNAMIC PARALLELIZATION OF 
SEQUENTIAL PROGRAMS 


Our approach strives to minimize the burden on program- 
mers. It allows programs to be authored in established im- 
perative programming languages, such as C++, and auto- 
mates their parallel execution. The model extracts a pro- 
gram’s computations, establishes the dynamic data-flow be- 
tween them, and schedules their ordered execution as the 
prevailing resources permit. It can also roll back the exe- 
cution, up to a desired point, and resume it, if desired. We 
highight the model’s principles by describing the program- 
ming interface and the mechanisms as implemented in the 
runtime (a C++ library). 


2.1 Composing Programs 


Programmers today follow modern software engineering and 
object-oriented (OO) design principles by composing pro- 
grams from reusable functions that manipulate encapsulated 
data and communicate with each other using well-defined in- 
terfaces. Often such “well-composed” functions avoid side- 
effects by only manipulating data communicated through the 
interfaces. We seek to exploit the properties of such OO pro- 
grams and the natural insights programmers have in their al- 
gorithms. 

Programs written using the runtime library closely resem- 
ble their sequential versions intended to run on a unipro- 
cessor, but for few user-annotations. Users compose pro- 
grams from computations and data structures amenable to 
concurrent execution, as they would conventional parallel 
programs. In addition, they annotate the code to identify po- 
tentially concurrent functions and the data potentially shared 
between them. They further formulate the shared data read 
and written (in the form of objects) by the functions, avail- 
able from the function’s interface, into read and write sets, 
respectively. Beyond these annotations the onus is not on 
the user to schedule execution of the computations or to en- 


sure independence of concurrent computations, in contrast 
to conventional parallel programmming. 


2.2 Executing Programs 


To execute a program on processing cores the runtime raises 
the granularity of computations to functions. It sequences 
through the program sequentially but seeks to execute the 
functions concurrently. Before executing a function the run- 
time establishes its dependence on already executing func- 
tions using the objects in the function’s read and write sets. 
Since objects in the read and write sets may be unknown stat- 
ically, their identity is established dynamically, at run-time, 
by dereferencing pointers. The runtime employs dataflow 
execution since it naturally exposes the innate parallelism 
between computations. Functions found to be independent 
are submitted for execution while those that are dependent 
are “shelved” until their dependences have resolved. The 
runtime continues to seek work beyond stalled computa- 
tions, resources permitting, and thus dynamically exploits 
any available parallelism. Moreover, it ensures that the ex- 
ecution proceeds as per the implied semantics that program- 
mers have come to expect from sequential programs. 

The runtime also provisions to handle functions (identi- 
fied by the user) which do not follow OO principles (e.g., 
with unknown side effects) by executing them sequentially. 

Statically-sequential applications (blackscholes, barneshut, 
bzip2, dedup, histogram, and reverse index) from standard 
benchmark suites, developed using the runtime on three 
stock multicore systems, an 8-thread Intel Nehalem-based 
machine, a 16-core and a 32-core AMD Opteron-based ma- 
chines, achieved speedups (harmonic mean) similar to their 
Pthread versions on the Nehalem machine and over 20% 
better on the AMD Opteron machines [1]. 


2.3 Time- and Energy-Efficient Execution 


Utilizing resources efficiently in dynamically changing en- 
vironments will be a key challenge going forward. Doing 
so will require exposing application parallelism that best 
fits the capabilities of resources in the execution environ- 
ment. While exposing too little parallelism can underuti- 
lize the resources, exposing excessive parallelism can lead 
to contention for resources, potentially degrading its time- 
and energy-efficiency. Dynamically matching the exposed 
parallelism with the changing capabilities of the execution 
environment requires the ability to suspend already execut- 
ing computations, reintroduce them later, and introduce new 
computations into the environment, as appropriate. The run- 
time exploits the implied ordering in statically-sequential 
programs to choose computations judiciously when regulat- 
ing the parallelism, while ensuring forward progress. It uses 
a Goodness of Parallelism (GoP) metric, computed periodi- 
cally as the execution unfolds, to correlate the instantaneous 
efficiency of the program to the instantaneous degree of par- 
allelism. A drop in efficiency causes it to throttle the par- 
allelism to ease contention, while an improvement in effi- 


ciency causes it to increase the parallelism to exploit avail- 
able resources. 

Experimental results on a stock 4-core (8-thread) Intel 
Core 17 2600 (Sandy Bridge) workstation show that our 
approach achieves up to 50% higher time- and energy- 
efficiency over the state-of-the-art parallel execution systems 
across a variety of dynamic operating conditions. 


2.4 Precise-Restartable Execution 


Future computer systems will present unreliable resources to 
applications due to exception events, e.g., hardware faults, 
timing errors caused by aggressive energy management, or 
due to resource management. To be efficient it will still 
be desirable to continue executing the interrupted program, 
possibly at a different time and/or on another system, with- 
out discarding all of the completed work. Hence to resume 
execution in such scenarios the runtime supports precise- 
restartability of parallel programs, analogous to precise- 
interruptible execution of sequential programs. 

The runtime exploits the implied ordering to precisely 
identify the excepted computation in the statically-sequential 
program and restores the program state to reflect the sequen- 
tial execution of the program up to the computation. To do 
so it tracks the invocation and completion of computation in 
the implied program order. Further, it checkpoints the state 
a computation may modify, i.e., its mod set (a user-provided 
set similar to the computation’s write set and processed sim- 
ilarly) before its execution. Once the excepting condition 
is mitigated the program may resume from the excepting 
computation. The runtime also incrementally checkpoints 
the program state after each computation successfully com- 
pletes, using its mod set. This state can be used to spatially 
or temporally migrate a halted program. 

Experiments on a stock 12-core (24-thread) Intel Xeon 
E5-2420 (Sandy Bridge) workstation show that the run- 
time can tolerate signficantly higher (proportional to thread- 
count) exceptions than the conventional approaches. De- 
pending on the application, the support to tolerate aggressive 
exception rates (e.g., up to 2 every second) incurs perfor- 
mance overheads ranging from 0% to 135% (at 0 faults). 


3. CONCLUSION 


Parallel programming for multicore-based systems and their 
dynamically changing operating environments pose signif- 
icant challenges to everyday programmers in the effort to 
improve productivity and to achieve error-free, efficient 
execution of their programs. We presented a model that 
meets these challenges better than other approaches by using 
statically-sequential programs and performing their dynam- 
ically controlled dataflow execution. 
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Challenges іп Future Computing Systems 


“ Programmer Productivity 
— Simplify programming of dynamic and static 
heterogeneous multicores 
— Enable portable, platform-agnostic programming 
* Efficiency 
— Optimize energy- and performance-efficiency 
- Adapt to dynamically and continuously changing 
operating conditions 
* Reliability 
- Tolerate increasingly unreliable resources 


- Provide differentiated levels of service 


Dynamic Parallelization 


* Statically-sequential programs (using parallel 
algorithms) 

* Dynamic dataflow parallel execution 
- Preserving sequential semantics 

* Dynamically controlled parallelism 

* Implemented with a software runtime library (C++ ) 
- Seamlessly addresses Productivity, Parallel 


Execution, Efficiency, and Reliability 


* Leverage modern Object-Oriented principles 
— Modularity, encapsulation 
* Exploit users' insights in their algorithms 
- Users identify potentially parallel functions, data 
potentially shared between them, and their read and 
write sets 
* Users do not ensure independence between 


computations, nor orchestrate parallel execution 
REC (85-121 < 7; i++) ( 

2 df execute (&F, wrSet, rdSet); 

3 
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Example of Statically-Sequential Code 


Simplified parallel programming 
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Parallelizing the Execution 


* Exploit Function-Level parallelism 
* Program execution unfolds sequentially 

- Functions execute concurrently (dataflow schedule) 
* Data dependences between functions established 


dynamically (using data sets) 


wrSet rdSet 


F1: {B, C} {A} 

F2:{D} {А} 

ЕЗ: (A, Е} {Е} = =. sa 45) 

ғ4: {8} (D) © C E 
Е5: {В} (D) 


Е6: {©} {Н} 


Тіте 


Dynamic Invocations of Example Code and Dependence Graph 


* Independent functions execute concurrently, 
dependent functions are serialized (in program order) 


* Dependences are tracked as functions execute and 


complete 
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Example Dataflow Execution Schedule on 3 Cores 


Harmonic Mean of Achieved Speedups 


10 m Pthread 


m Dataflow i 
Applications: 
Barneshut 

6 1 Blackscholes 


Bzip2 
| E 
0 


Dedup 
Histogram 
8x Nehalem 16x Barcelona 32x Barcelona 
Corei7-965 Opteron 8350 Opteron 8356 


Speedup 


RevereseIndex 


+ Up to 20% higher speedups than conventional 


implementations 


Benefits of Order and Dataflow Execution 


* Dynamically establish dependences to honor 

* Sequentially determinate, predictable and 
repeatable execution 

* Freedom from deadlocks; guaranteed forward 
progress 

* Arbitrary control of execution to optimize efficiency 


* Precise-restartability of halted computations 


Efficient Execution 


* "Goodness of Parallelism" metric to assess 
instantaneous efficiency 
- Measured periodically 


* Adapt degree-of-parallelism to resource contention 


— Optimize for time- and energy-efficiency 
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Dynamically Adaptive Execution: F6 Delayed to Improve Efficiency 


Improving Time- and Energy-Efficiency 
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On Intel, 4-core, 8-thread Sandy 3.4GHz Sandy Bridge 


+ Up to 50% more time- and energy-efficient than state-of-the- 


art parallel execution systems 


Precise-Restartable Execution 


* Order of executing computations tracked using а 
Reorder List 

* Functions “retired” in program order 

* Computation state checkpointed in History Buffer 


- Restored on exception, if needed 
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Precise-Restartable Execution 
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* Tolerates significantly higher fault rates (proportional 
to thread-count) than conventional methods 


“ Incurs 0% to 135% performance overhead (at 0 faults) 


Program Execution on Future Multicores 


Dynamically controlled, dataflow execution of 
statically-sequential programs 

+ Enables simplified, platform-independent 
programming 

+ Automates platform-specific, dynamic and 
continuous optimizations for energy- and 
performance -efficiency 

“ Tolerates high fault rates and manages resources 


at low overheads 


